home *** CD-ROM | disk | FTP | other *** search
Text File | 1997-02-28 | 139.6 KB | 3,096 lines |
- HOW TO OPTIMIZE FOR THE PENTIUM PROCESSOR
- *****************************************
- Copyright (c) 1996, 1997 by Agner Fog. Last modified 1997-02-28.
-
- Contents:
-
- 1. introduction
- 2. literature
- 3. debugging and verifying
- 4. memory model
- 5. alignment
- 6. cache
- 7. address generation interlock
- 8. pairing integer instructions
- 9. first time vs. repeated execution
- 10. imperfect pairing
- 11. splitting complex instructions into simpler ones
- 12. jumps and branches
- 13. prefixes
- 14. reducing code size
- 15. scheduling floating point code
- 16. loops
- 17. discussion of special instructions
- 18. using integer instructions to do floating point operations
- 19. using floating point instructions to do integer operations
- 20. list of integer instructions
- 21. list of floating point instructions
- 22. testing code speed
- 23. considerations for other microprocessors
-
-
- 1. INTRODUCTION
- ===============
- This manual describes in detail how to write optimized assembly language
- code, with particular focus on the Pentium(R) microprocessor.
-
- This manual is more comprehensive and exact than other sources of
- information and contains many details not found anywhere else.
- This information will enable you in many cases to calculate exactly how
- many clock cycles a piece of code will take.
-
- The information in this manual is based on my own research. This research
- has so far covered only the Pentium Processor without MMX. The Pentium
- with MMX behaves almost the same. Differences are mentioned where
- appropriate. This manual will be updated later with more information on
- the Pentium with MMX. The PentiumPro and PentiumII processors are very
- different, and will be covered only briefly. I have no plans of doing
- research on the PentiumPro and PentiumII processors.
-
- Programming in assembly language is much more difficult than high level
- language. Making bugs is very easy, and finding them is very difficult. Now
- you have been warned! It is assumed that the reader is already experienced
- in assembly programming. If not, then please read some books on the subject
- and get some programming experience before you begin to do complicated
- optimations.
-
- The hardware design of the Pentium chip has many features which are
- optimized specifically for some commonly used instructions or instruction
- combinations, rather than using general optimation methods. Consequently,
- the rules for optimizing software for this design are quite complicated
- and have many exceptions, but the possible gain in performance may be
- substantial.
-
- Before you start to convert your code to assembly, make sure that your
- algorithm is optimal. Often you can improve a piece of code much more by
- improving the algorithm than by converting it to assembler code.
-
- Some high level language compilers offer relatively good optimation for
- the Pentium, and further optimation by hand may not be worth the effort
- except for the most critical pieces of code - or for the sport of it.
-
- Please don't send your programming questions to me. I am not gonna do your
- homework for you.
-
- Good luck with your hunt for nanoseconds!
-
- PS. I have deliberately misspelled 'optimization' because I think
- 'optimation' sounds better.
-
-
- 2. LITERATURE
- =============
- A lot of useful literature can be downloaded for free from Intel's www
- site or acquired in print or on CD-ROM.
-
- I will not give the URL's here because they change very often.
- You can find the documents you need by using the search facilities at:
- http://www.intel.com/sites/developer/search.htm or follow the links from
- http://announce.com/agner/assem
-
- Some documents are in .PDF format. If you don't have software for viewing
- or printing .PDF files, then you may download the Acrobat file reader from
- www.adobe.com
-
- The most useful document is Intel's application note:
- 'AP-526 Optimizations for Intel's 32-bit Processors'
-
- Beginners may use a funny tutorial named "Optimizing Applications for the
- Pentium Processor"
-
- Optimations for the PentiumPro and PentiumII processors are described in
- 'AP-526 Optimizations for Intel's 32-bit Processors' and in the tutorial
- 'Optimizing for the Pentium« Pro Processor Family'.
-
- The use of MMX instructions for optimizing specific applications are
- described in several application notes. The MMX instruction set is
- described in various manuals and tutorials.
-
- A lot of other sources than Intel also have useful information. These
- sources are listed in the FAQ for the newsgroup comp.lang.asm.x86.
- The shareware editor ASMEDIT has an online help which covers all
- instruction codes etc. It is available from:
- http://www.inf.tu-dresden.de/~ok3/asmedit.html
-
- For other internet ressources follow the links from
- http://announce.com/agner/assem/
-
-
-
- 3. DEBUGGING AND VERIFYING
- ==========================
- Debugging assembly code can be quite hard and frustrating, as you probably
- already have discovered. I would recommend that you start with writing the
- piece of code you want to optimize as a subroutine in a high level language.
- Next, write a test program that will test your subroutine thoroughly. Make
- sure the test program goes into all branches and special cases.
-
- When your high level language subroutine works with your test program, then
- you are ready to translate the code to assembly language (some high level
- compilers will do this for you), and test it on the test program.
-
- Then you can start to optimize. Each time you have made a modification you
- should run it on the test program to see if it works correctly.
-
- Number all your versions and save them so that you can go back and test
- them again in case you discover an error which the test program didn't
- catch (such as writing to a wrong address).
-
- Test the speed of the most critical part of your program with the method
- described in chapter 22. If the code is significantly slower than expected,
- then the most probable reasons are: cache misses (chapter 6), misaligned
- operands (chapter 5), and first time penalty (chapter 9).
-
-
- 4. MEMORY MODEL
- ===============
- The Pentium is designed primarily for 32 bit code, and it's performance is
- inferior on 16 bit code. Segmenting your code and data also degrades
- performance significantly, so you should generally prefer 32 bit flat mode,
- and an operating system which supports this mode (e.g. Windows 95, OS/2, or
- a 32 bit DOS extender). The code examples shown in this manual assume a 32
- bit flat memory model, unless otherwise specified.
-
-
- 5. ALIGNMENT
- ============
- All data in RAM should be aligned to addresses divisible by 2, 4, or 8
- according to this scheme:
-
- operand size alignment
- ---------------------------
- 1 (byte) 1
- 2 (word) 2 (or address MOD 4 >< 3. other proc. require align by 2)
- 4 (dword) 4
- 6 (fword) 4 (Pentium Pro requires alignment by 8)
- 8 (qword) 8
- 10 (tbyte) 8 (Pentium Pro requires alignment by 16)
-
- Misaligned data will take at least 3 clock cycles extra to access.
-
- Aligning code is not necessary when you run on the Pentium, but for optimal
- performance on other processors you may align subroutine entries and loop
- entries by 16.
-
-
- 6. CACHE
- ========
- The Pentium without MMX has 8 kb of on-chip cache (level one cache) for
- code, and 8 kb for data. The Pentium with MMX has 16 kb for code and 16 kb
- for data. Data in the level 1 cache can be read or written to in just one
- clock cycle, whereas a cache miss may cost many clock cycles. It is
- therefore important that you understand how the cache works in order to use
- it most efficiently.
-
- The data cache in the Pentium without MMX consists of 256 lines of 32 bytes
- each. Each time you read a data item which is not cached, the processor will
- read an entire cache line from memory. The cache lines are always aligned to
- a physical address divisible by 32. When you have read a byte at an address
- divisible by 32, then the next 31 bytes can be read or written to at almost
- no extra cost. You can take advantage of this by arranging data items which
- are used near each other together into aligned blocks of 32 bytes of memory.
- If, for example, you have a loop which accesses two arrays, then you may
- combine the two arrays into one array of structures, so that data which are
- used together are also stored together.
-
- If the size of an array or other data structure is a multiple of 32 bytes,
- then you should preferably align it by 32.
-
- A cache line can not be assigned to an arbitrary memory address. Each cache
- line has a 7-bit set-value which must match bit 5 through 11 of the physical
- RAM address (bit 0-4 define the 32 bytes within a cache line). There are two
- cache lines for each of the 128 set-values, so there are two possible cache
- lines to assign to any RAM address (four for the MMX version). The
- consequence of this is that the cache can hold no more than two different
- data blocks which have the same value in bits 5-11 of the address. You can
- determine if two addresses have the same set-value by the following method:
- Strip off the lower 5 bits of each address to get a value divisible by 32.
- If the difference between the two truncated addresses is a multiple of 4096
- (=1000H), then the addresses have the same set-value.
-
- Let me illustrate this by the following piece of code, where ESI holds an
- address divisible by 32:
-
- AGAIN: MOV EAX, [ESI]
- MOV EBX, [ESI + 13*4096 + 4]
- MOV ECX, [ESI + 20*4096 + 28]
- DEC EDX
- JNZ AGAIN
-
- The three addresses used here all have the same set-value because the
- differences between the truncated addresses are multipla of 4096. This loop
- will perform very poorly. At the time you read ECX there is no free cache
- line with the proper set-value so the processor takes the least recently
- used of the two possible cache lines, that is the one which was used for EAX,
- and fills it with the data from [ESI + 20*4096] to [ESI + 20*4096 + 31] and
- reads ECX. Next, when reading EAX, you find that the cache line that held
- the value for EAX has now been discarded, so you take the least recently
- used line, which is the one holding the EBX value, and so on.. You have
- nothing but cache misses and the loop takes something like 60 clock cycles.
- If the third line is changed to:
-
- MOV ECX, [ESI + 20*4096 + 32]
-
- then we have crossed a 32 byte boundary, so that we do not have the same
- set-value as in the first two lines, and there will be no problem assigning
- a cache line to each of the three addresses. The loop now takes only 3 clock
- cycles (except for the first time) - a very considerable improvement!
-
- It may be very difficult to determine if your data addresses have the same
- set-values, especially if they are scattered around in different segments.
- The best thing you can do to avoid problems of this kind is to keep all data
- used in the critical part or your program within one contiguous block of
- max. 8 kbytes or two contiguous blocks of max. 4 kbytes each (for example
- one block for static data and another block for data on the stack). This
- will make sure that your cache lines are used optimally.
-
- If the critical part of your code accesses big data structures or random
- data addresses, then you may want to keep all frequently used variables
- (counters, pointers, control variables, etc.) within a single contiguous
- block of max 4 kbytes so that you have a complete set of cache lines free
- for accessing random data. Since you probably need stack space anyway for
- subroutine parameters and return addresses, the best thing is to copy all
- frequently used static data to dynamic variables on the stack, and copy
- them back again outside the critical loop if they have been changed.
-
- Reading a data item which is not in the level one cache causes an entire
- cache line to be filled from the level two cache, which takes approximately
- 200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz
- system), but the bytes you ask for first are available already after 50-100
- ns. If the data item is not in the level two cache either, then you will
- get a delay of something like 200-300 ns. This delay will be somewhat longer
- if you cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 MB
- 72 pin RAM modules, and 2 kb for 16 MB modules).
-
- When you write to an address which is not in the level 1 cache, then the
- value will go right through to the level 2 cache or to the RAM (depending on
- how the level 2 cache is set up). This takes approximately 100 ns. If you
- write eight or more times to the same 32 byte block of memory without also
- reading from it, and the block is not in the level one cache, then it may
- be advantageous to make a dummy read from the block first to load it into a
- cache line. All subsequent writes to the same block will then go to the
- cache instead, which takes only one clock cycle. (This is not needed on a
- PentiumPro, which always loads a cache line on write misses). There is
- sometimes a small penalty for writing repeatedly to the same address
- without reading in between.
-
- Another way to reduce the number of write misses is to write 8 bytes at a
- time using FILD / FISTP with qword operands rather than writing 4 bytes at
- a time using integer registers. The extra cost of the slow FILD and FISTP
- instructions is compensated for by the fact that you only have to do half as
- many writes. However, this method is only advantageous on a Pentium, and
- only if the destination is not in the level 1 cache. The method is explained
- in section 19 below.
-
- Temporary data may conveniently be stored at the stack because stack data
- are very likely to be in the cache. You should be aware, however, that if
- you store QWORD data on a DWORD size stack, or DWORD data on a WORD size
- stack, then you have a potential alignment problem.
-
- If the life ranges of two data structures do not overlap, then they may use
- the same RAM area to increase cache efficiency. This is consistent with the
- common practice of allocating space for temporary variables on the stack.
-
- Storing temporary data in registers is of course even more efficient. Since
- registers is a scarce ressource, you may want to use [ESP] rather than [EBP]
- for addressing data on the stack, in order to free EBP for other purposes.
- Just don't forget that the value of ESP changes every time you do a PUSH or
- POP. (You cannot use ESP under 16-bit Windows because the timer interrupt
- will modify the high word of ESP at unpredictable places in your code.)
-
- The Pentium has a separate 8 kb cache for code, which is similar to the data
- cache. It is important that the critical part of your code (the innermost
- loops) fit in the code cache. Frequently used pieces of code or routines
- which are used together should preferable be stored near each other. Seldom
- used branches or procedures should go to the bottom of your code.
-
-
- 7. ADDRESS GENERATION INTERLOCK (AGI)
- =====================================
- It takes one clock cycle to calculate the address needed by an instruction
- which accesses memory. Normally, this calculation is done at a separate
- stage in the pipeline while the preceding instruction or instruction pair is
- executing. But if the address depends on the result of an instruction
- executing in the preceding clock cycle, then you have to wait an extra clock
- cycle for the address to be calculated. This is called an AGI stall.
-
- Example:
- ADD EBX,4 / MOV EAX,[EBX] ; AGI stall
- The stall in this example can be removed by putting some other instructions
- in between ADD EBX,4 and MOV EAX,[EBX] or by rewriting the code to:
- MOV EAX,[EBX+4] / ADD EBX,4
-
- You can also get an AGI stall with instructions which use ESP implicitly for
- addressing, such as PUSH, POP, CALL, and RET, if ESP has been changed in the
- preceding clock cycle by instructions such as MOV, ADD, or SUB. The Pentium
- has special circuitry to predict the value of ESP after a stack operation so
- that you do not get an AGI delay after changing ESP with PUSH, POP, or CALL.
- You can get an AGI stall after RET only if it has an immediate operand to
- add to ESP.
-
- Examples:
- ADD ESP,4 / POP ESI ; AGI stall
- POP EAX / POP ESI ; no stall, pair
- MOV ESP,EBP / RET ; AGI stall
- CALL L1 / L1: MOV EAX,[ESP+8] ; no stall
- RET / POP EAX ; no stall
- RET 8 / POP EAX ; AGI stall
-
- The LEA instruction is also subject to an AGI stall if it uses a base or
- index register which has been changed in the preceding clock cycle. Example:
- INC ESI / LEA EAX,[EBX+4*ESI] ; AGI stall
-
-
- 8. PAIRING INTEGER INSTRUCTIONS
- ===============================
- The Pentium has two pipelines for executing instructions, called the U-pipe
- and the V-pipe. Under certain conditions it is possible to execute two
- instructions simultaneously, one in the U-pipe and one in the V-pipe. This
- can almost double the speed. It is therefore advantageous to reorder your
- instructions to make them pair.
-
- The following instructions are pairable in either pipe:
- MOV register, memory, or immediate into register or memory
- PUSH register or immediate
- POP register
- LEA, NOP
- INC, DEC, ADD, SUB, CMP, AND, OR, XOR,
- and some forms of TEST (see section 17.3)
-
- The following instructions are pairable in the U-pipe only:
- ADC, SBB
- SHR, SAR, SHL, SAL with immediate count
- ROR, ROL, RCR, RCL with an immediate count of 1
-
- The following instructions can execute in either pipe but are only pairable
- when in the V-pipe: near call, short and near jump, short and near
- conditional jump.
-
- All other integer instructions can execute in the U-pipe only, and are not
- pairable.
-
- Two consecutive instructions will pair when the following conditions are met:
-
- 8.1 The first instruction is pairable in the U-pipe and the second
- instruction is pairable in the V-pipe.
-
- 8.2 The second instruction cannot read or write a register which the first
- instruction writes to. Examples:
- MOV EAX, EBX / MOV ECX, EAX ; read after write, do not pair
- MOV EAX, 1 / MOV EAX, 2 ; write after write, do not pair
- MOV EBX, EAX / MOV EAX, 2 ; write after read, pair OK
- MOV EBX, EAX / MOV ECX, EAX ; read after read, pair OK
- MOV EBX, EAX / INC EAX ; read and write after read, pair OK
-
- 8.3 In rule 8.2 partial registers are treated as full registers. Example:
- MOV AL, BL / MOV AH, 0 ; writes to different parts of the same
- ; register, do not pair
-
- 8.4 Two instructions which both write to parts of the flags register can
- pair despite rule 8.2 and 8.3. Example:
- SHR EAX,4 / INC EBX ; pair OK
-
- 8.5 An instruction which writes to the flags can pair with a conditional
- jump despite rule 8.2. Example:
- CMP EAX, 2 / JA LabelBigger ; pair OK
-
- 8.6 The following instruction combinations can pair despite the fact that
- they both modify the stack pointer:
- PUSH + PUSH, PUSH + CALL, POP + POP
-
- 8.7 There are restrictions on the pairing of instructions with prefix.
- There are several types of prefixes:
- - instructions addressing a non-default segment have a segment prefix.
- - instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit
- mode have an operand size prefix.
- - instructions using 32 bit base or index registers in 16 bit mode have
- an address size prefix.
- - repeated string instructions have a repeat prefix.
- - locked instructions have a LOCK prefix.
- - many instructions which were not implemented on the 8086 processor
- have a two byte opcode where the first byte is 0FH. The 0FH byte
- behaves as a prefix on the Pentium without MMX, but not on the Pentium
- with MMX. The most common instructions with 0FH prefix are: MOVZX,
- MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT,
- BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, and IMUL with two operands and no
- immediate operand.
-
- On the Pentium without MMX, a prefixed instruction can only execute in
- the U-pipe, except for conditional near jumps.
- On the Pentium with MMX, instructions with operand size, address
- size, or 0FH prefix can execute in either pipe, whereas instructions with
- segment, repeat, or lock prefix can only execute in the U-pipe.
-
- 8.8 An instruction which has both a displacement and immediate data is not
- pairable on the Pentium without MMX and only pairable in the U-pipe on
- Pentium with MMX:
- MOV DWORD PTR DS:[1000], 0 ; not pairable or only in U-pipe
- CMP BYTE PTR [EBX+8], 1 ; not pairable or only in U-pipe
- CMP BYTE PTR [EBX], 1 ; pairable
- CMP BYTE PTR [EBX+8], AL ; pairable
- (Another problem with instructions which have both a displacement and
- immediate data on the Pentium with MMX is that such instructions may
- be longer than 7 bytes, which means that only one instruction can be
- decoded per clock cycle, as explained in section 13.)
-
- 8.9 Both instructions must be preloaded and decoded. This is explained in
- section 9.
-
- 8.10 There are special pairing rules for MMX instructions:
- - MMX shift instructions can execute in either pipe but cannot pair with
- other MMX shift instructions
- - MMX multiply instructions can execute in either pipe but cannot pair
- with other MMX multiply instructions. They take 3 clock cycles and
- can be pipelined to a throughput of one multiplication per clock cycle
- in the same way as floating point instructions (see chapter 15).
- - an MMX instruction which accesses memory or integer registers can
- execute only in the U-pipe and cannot pair with a non-MMX instruction.
- for further details see the Intel manuals.
-
-
-
- 9. FIRST TIME VERSUS REPEATED EXECUTION
- =======================================
- A piece of code usually takes much more time the first time it is executed
- than when it is repeated. The reasons are the following:
-
- 9.1 Loading the code from RAM into the cache takes longer time than
- executing it.
-
- 9.2 In some cases decoding the code is the bottleneck. If it takes one clock
- cycle to determine the length of an instruction, then it is not possible
- to decode two instructions per clock cycle, because the processor
- doesn't know where the second instruction begins. The Pentium solves
- this problem by remembering the length of any instruction which has
- remained in the cache since last time it was executed. As a consequence
- of this, a set of instructions will not pair the first time they are
- executed, unless the first of the two instructions is only one byte
- long.
-
- 9.3 Jump instructions will not be in the branch target buffer the first
- time they execute, and therefore are less likely to be predicted
- correctly. See chapter 12.
-
- 9.4 Any data accessed by the code has to be loaded into the cache, which may
- take much more time than executing the instructions. When the code is
- repeated, then the data are more likely to be in the cache.
-
- For these four reasons, a piece of code inside a loop will generally take
- much more time the first time it executes than the subsequent times.
-
- If you have a big loop which doesn't fit into the code cache then you will
- get the penalty of 9.1 and 9.2 all the time because it doesn't run from the
- cache. You should therefore try to reorganize the loop to make it fit into
- the cache.
-
- If you have many jumps and branches inside a loop, then you may get the
- penalty in 9.3 all the time.
-
- Likewise, if a loop repeatedly accesses a data structure too big for
- the data cache, then you will get the penalty of data cache misses all the
- time.
-
-
- 10. IMPERFECT PAIRING
- =====================
- There are situations where the two instructions in a pair will not execute
- simultaneously, or only partially overlap in time. They should still be
- considered a pair, though, because the first instruction executes in the
- U-pipe, and the second in the V-pipe. No subsequent instruction can start
- to execute before both instructions in the imperfect pair have finished.
-
- Imperfect pairing will happen in the following cases:
-
- 10.1 If the second instructions suffers an AGI stall (see chapter 7).
-
- 10.2 Two instructions cannot access the same dword of memory simultaneously.
- The following examples assume that ESI is divisible by 4:
- MOV AL, [ESI] / MOV BL, [ESI+1]
- The two operands are within the same dword, so they cannot execute
- simultaneously. The pair takes 2 clock cycles.
- MOV AL, [ESI+3] / MOV BL, [ESI+4]
- Here the two operands are on each side of a dword boundary, so they
- pair perfectly, and take only one clock cycle.
-
- 10.3 Rule 10.2 is extended to the case where bit 2-4 is the same in the two
- addresses (cache bank conflict). For dword addresses this means that
- the difference between the two addresses should not be divisible by 32.
- Examples:
- MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing
- MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing
-
- Pairable instructions which do not access memory take one clock cycle to
- execute, except for mispredicted jumps. MOV instructions to or from memory
- also take only one clock cycle if the data area is in the cache and properly
- aligned. There is no speed penalty for using complex addressing modes such
- as scaled index registers.
-
- A pairable instruction which reads from memory, does some calculation, and
- stores the result in a register or flags, takes 2 clock cycles.
- (read/modify instructions).
-
- A pairable instruction which reads from memory, does some calculation, and
- writes the result back to the memory, takes 3 clock cycles.
- (read/modify/write instructions).
-
- 10.4 If a read/modify/write instruction is paired with a read/modify or
- read/modify/write instruction, then they will pair imperfectly.
-
- The number of clock cycles used is given in the following table:
-
- | First instruction
- | MOV or read/ read/modify/
- Second instruction | register only modify write
- ----------------------|----------------------------------------------
- MOV or register only | 1 2 3
- read/modify | 2 2 4
- read/modify/write | 3 3 5
- ----------------------|-----------------------------------------------
-
- Example:
- ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles
- ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles
-
- 10.5 When two paired instructions both take extra time due to cache misses,
- misalignment, or jump misprediction, then the pair will take more time
- than each instruction, but less than the sum of the two.
-
- In order to avoid imperfect pairing you have to know which instructions go
- into the U-pipe, and which to the V-pipe. You can find out this by looking
- backwards in your code and search for instructions which are unpairable,
- pairable only in one of the pipes, or cannot pair due to one of the rules
- in chapter 8.
-
- Imperfect pairing can often be avoided by reordering your instructions.
- Example:
-
- L1: MOV EAX,[ESI]
- MOV EBX,[ESI]
- INC ECX
-
- Here the two MOV instructions form an imperfect pair because they both
- access the same memory location, and the sequence takes 3 clock cycles.
- You can improve the code by reordering the instructions so that INC ECX
- pairs with one of the MOV instructions.
-
- L2: MOV EAX,OFFSET [A]
- XOR EBX,EBX
- INC EBX
- MOV ECX,[EAX]
- JMP L1
-
- The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter
- instruction has an AGI stall. The sequence takes 4 clocks. If you insert
- a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1
- instead, then the sequence takes only 3 clocks.
-
- The next example is in 16 bit mode, assuming that SP is divisible by 4:
-
- L3: PUSH AX
- PUSH BX
- PUSH CX
- PUSH DX
- CALL FUNC
-
- Here the PUSH instructions form two imperfect pairs, because both operands
- in each pair go into the same dword of memory. PUSH BX could possibly pair
- perfectly with PUSH CX (because they go on each side of a dword boundary)
- but it doesn't because it has already been paired with PUSH AX. The sequence
- therefore takes 5 clocks. If you insert a NOP or any other instruction so
- that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the
- sequence will take only 3 clocks. Another way to solve the problem is to
- make sure that SP is not divisible by 4. Knowing whether SP is divisible by
- 4 or not in 16 bit mode can be difficult, so the best way to avoid this
- problem is to use 32 bit mode.
-
-
- 11. SPLITTING COMPLEX INSTRUCTIONS INTO SIMPLER ONES
- ====================================================
- You may split up read/modify and read/modify/write instructions to improve
- pairing. Example:
- ADD [mem1],EAX / ADD [mem2],EBX ; 5 clock cycles
- This code may be split up into a sequence which takes only 3 clock cycles:
- MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
- MOV [mem1],ECX / MOV [mem2],EDX
-
- Likewise you may split up non-pairable instructions into pairable instructions:
- PUSH [mem1] / PUSH [mem2] ; non-pairable
- Split up into:
- MOV EAX,[mem1] / MOV EBX,[mem2] / PUSH EAX / PUSH EBX ; everything pairs
-
- Other examples of non-pairable instructions which may be split up into simpler
- pairable instructions:
- CDQ split into: MOV EDX,EAX / SAR EDX,31
- NOT EAX change to XOR EAX,-1
- NEG EAX split into XOR EAX,-1 / INC EAX
- MOVZX EAX,BYTE PTR [mem] split into XOR EAX,EAX / MOV AL,BYTE PTR [mem]
- JECXZ split into TEST ECX,ECX / JZ
- LOOP split into DEC ECX / JNZ
- XLAT change to MOV AL,[EBX+EAX]
-
- If splitting instructions doesn't improve speed, then you may keep the
- complex or nonpairable instructions in order to reduce code size.
-
-
- 12. JUMPS AND BRANCHES
- ======================
- Note: This chapter describes in detail the branch prediction mechanism of
- the Pentium without MMX. This information is based on research done by me
- and Karki Jitendra Bahadur. Information found in Intel documents and
- elsewhere on this subject is directly misleading, and following the
- advises given is such documents is likely to lead to sub-optimal code.
-
- In the following, I will use the term 'jump instruction' for any
- instruction which can change the instruction pointer, including
- conditional and unconditional, direct and indirect, near and far, jumps,
- calls, and returns.
-
- The Pentium attempts to predict where a jump will go to, and whether a
- conditional jump will be taken or fall through. If the prediction is correct,
- then it can save time by loading the subsequent instructions into the
- pipeline before the jump is executed. If the prediction turns out to be
- wrong, then the pipeline has to be flushed, which will cost a penalty of 3
- to 5 clock cycles. (4 for a conditional jump in the V-pipe, 3 in all other
- cases for the Pentium without MMX. The MMX version has penalties of 5 resp.
- 4 clocks.)
-
- When optimizing code, it is important to minimize the number of misprediction
- penalties. This requires a good understanding of how the jump prediction
- works. I am giving a very detailed description of this topic here, because
- this information is not published anywhere else.
-
- Note that the following description applies only to the Pentium without
- MMX. Branch prediction in later processors will only be covered briefly.
-
-
- Branch prediction
- -----------------
- The Pentium uses a 'branch target buffer' (BTB), which can save information
- for up to 256 jump instructions. The BTB is organized like a 4-way set-
- associative cache with 64 entries per way. This means that the BTB can hold
- no more than 4 entries with the same set value. Unlike the data cache, the
- BTB uses a pseudo random replacement algorithm, which means that a new entry
- will not necessarily displace the least recently used entry of the same set-
- value. How the set-value is calculated will be explained later. Each BTB
- entry stores the address of the jump target and a prediction state, which
- can have four different values:
-
- state 0: "strongly not taken"
- state 1: "weakly not taken"
- state 2: "weakly taken"
- state 3: "strongly taken"
-
- A branch instruction is predicted to jump when in state 2 or 3, and to
- fall through when in state 0 or 1. Some documents do - not quite correctly -
- describe the state transition so that the state is incremented when the
- branch is taken, and decremented when it falls through (no wrap around).
- This would provide quite a good prediction, because a branch instruction
- would have to deviate twice from what it does most of the time, before the
- prediction changes.
-
- However, this mechanism has been compromised by the fact that state
- 0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as
- no BTB entry. This makes sense, because a branch instruction is predicted
- to fall through if it has no BTB entry. This improves the utilization of
- the BTB, because a branch instruction which is seldom taken will most of
- the time not take up any BTB entry.
-
- Now, if a jumping instruction has no BTB entry, then a new BTB entry will
- be generated, and this new entry will always be set to state 3. This means
- that it is impossible to go from state 0 to state 1 (except for a very
- special case discussed later). From state 0 you can only go to state 3, if
- the branch is taken. If the branch falls through, then it will stay out of
- the BTB.
-
- This is a serious design flaw. By always setting new entries to state
- 3, the designers apparently have given priority to minimizing the first
- time penalty for unconditional jumps and branches often taken, rather than
- improving the prediction of conditional jumps in tight innermost loops,
- which is almost the only place where speed really matters. The consequence
- of this flaw is, that a branch instruction which falls through most of the
- time will have up to three times as many mispredictions as a branch
- instruction which is taken most of the time.
-
- You may take this asymmetry into account by organizing your branches so
- that they are taken more often than not.
-
- Consider for example this if-then-else construction:
-
- TEST EAX,EAX
- JZ A
- <branch 1>
- JMP E
- A: <branch 2>
- E:
-
- If branch 1 is executed more often than branch 2, and branch 2 is seldom
- executed twice in succession, then you can reduce the number of branch
- mispredictions by up to a factor 3 by swapping the two branches so that
- the branch instruction will jump more often than fall through:
-
- TEST EAX,EAX
- JNZ A
- <branch 2>
- JMP E
- A: <branch 1>
- E:
-
- (This is _contrary_ to the recommendations in Intel's tutorial. They don't
- seem to recognize the asymmetry in branch prediction. But it is easy to
- prove that the latter example is superior).
-
- There may be reasons to put the most often executed branch first, however:
- 1. Putting seldom executed branches away in the bottom of your code can
- improve code cache utilization.
- 2. A branch instruction seldom taken will stay out of the BTB most of the
- time, possibly improving BTB utilization.
- 3. The branch instruction will be predicted as not taken if it has been
- flushed out of the BTB by other branch instructions.
- 4. The asymmetry in branch prediction only exists on the Pentium without MMX (?)
-
- These considerations have little weight, however, for small critical loops,
- so I would still recommend organizing branches with a skewed distribution
- so that the branch instruction is taken more often than not, unless branch
- 2 is executed so seldom, that misprediction doesn't matter.
-
- Likewise, you should preferably organize loops with the testing branch at
- the bottom, as in this example:
-
- MOV ECX, [N]
- L: MOV [EDI],EAX
- ADD EDI,4
- DEC ECX
- JNZ L
-
- If N is high, then the JNZ instruction here will be taken more often than
- not, and never fall through twice in succession.
-
-
- BTB is looking ahead
- --------------------
- The BTB mechanism is counting instruction pairs, rather than single
- instructions, so you have to know how instructions are pairing in order
- to analyze where a BTB entry is stored. The BTB entry for any jump
- instruction is attached to the address of the U-pipe instruction in the
- preceding instruction pair. (An unpaired instruction counts as one pair).
- Example:
-
- SHR EAX,1
- MOV EBX,[ESI]
- CMP EAX,EBX
- JB L
-
- Here SHR pairs with MOV, and CMP pairs with JB. The BTB entry for
- JB L is thus attached to the address of the SHR EAX,1 instruction. When
- this BTB entry is met, and if it is in state 2 or 3, then the Pentium will
- read the target address from the BTB entry, and load the instructions
- following L into the pipeline. This happens before the branch instruction
- has been decoded, so the Pentium relies solely on the information in the
- BTB when doing this.
-
- You may remember, that instructions are seldom pairing the first time they
- are executed (see paragraph 9.2). If the instructions above are not
- pairing, then the BTB entry should be attached to the address of the CMP
- instruction, and this entry would be wrong on the next execution, when
- instructions are pairing. However, in most cases the Pentium is smart enough
- to not make a BTB entry when there is an unused pairing opportunity, so you
- don't get a BTB entry until the second execution, and hence you won't get a
- prediction until the third execution. (In the rare case, where every second
- instruction is a single-byte instruction, you may get a BTB entry on the
- first execution which becomes invalid in the second execution, but since
- the instruction it is attached to will then go to the V-pipe, it is ignored
- and gives no penalty. A BTB entry is only read if it is attached to the
- address of a U-pipe instruction).
-
- A BTB entry is identified by its set-value which is equal to bits 0-5
- of the address it is attached to. Bits 6-31 are then stored in the BTB as
- a tag. Addresses which are spaced a multiple of 64 bytes apart will have
- the same set-value. You can have no more than four BTB entries with the
- same set-value. If you want to check whether your jump instructions contend
- for the same BTB entries, then you have to compare bits 0-5 of the addresses
- of the U-pipe instructions in the preceding instruction pairs. This is very
- tedious, and I have never heard of anybody doing so. There are no tools
- available to do this job for you, as far as I know.
-
-
- Consecutive branches
- --------------------
- When a jump is mispredicted, then the pipeline gets flushed. If the next
- instruction pair executed also contains a jump instruction, then the Pentium
- won't load its target because it cannot load a new target while the
- pipeline is being flushed. The result is that the second jump instruction
- is predicted to fall through regardless of the state of its BTB entry.
- Therefore, if the second jump is also taken, then you will get another
- penalty. The state of the BTB entry for the second jump instruction does get
- correctly updated, though. If you have a long chain of taken jump
- instructions, and the first jump in the chain is mispredicted, then the
- pipeline will get flushed all the time, and you will get nothing but
- mispredictions until you meet an instruction pair that does not jump. The
- most extreme case of this is a loop which jumps to itself: It will get a
- misprediction penalty for each iteration.
-
- This is not the only problem with consecutive jumps. Another problem is that
- you can have a branch instruction between a BTB entry and the jump
- instruction it belongs to. If the first branch instruction jumps to
- somewhere else, then strange things may happen. Consider this example:
-
- SHR EAX,1
- MOV EBX,[ESI]
- CMP EAX,EBX
- JB L1
- JMP L2
-
- L1: MOV EAX,EBX
- INC EBX
-
- When JB L1 falls through, then you will get a BTB entry for JMP L2
- attached to the address of CMP EAX,EBX. But what will happen when
- JB L1 later is taken? At the time when the BTB entry for JMP L2 is
- read, the processor doesn't know that the next instruction pair does not
- contain a jump instruction, so it will actually predict the instruction
- pair MOV EAX,EBX / INC EBX to jump to L2. The penalty for predicting
- non-jump instructions to jump is 3 clock cycles. The BTB entry for JMP L2
- will get its state decremented, because it is applied to something which
- doesn't jump. If we keep going to L1, then the BTB entry for JMP L2
- will be decremented to state 1 and 0, so that the problem will disappear
- until next time JMP L2 is executed.
-
- The penalty for predicting the non-jumping instructions to jump only occurs
- when the jump to L1 is predicted. In the case that JB L1 is mispredictedly
- jumping, then the pipeline gets flushed and we won't get the false L2 target
- loaded, so in this case we will not see the penalty of predicting the
- non-jumping instructions to jump, but we do get the BTB entry for JMP L2
- decremented.
-
- Suppose, now, that we replace the INC EBX instruction above with another
- jump instruction. This third jump instruction will then use the same
- BTB entry as JMP L2 with the possible penalty of predicting a wrong target,
- (unless it happens to also have L2 as target).
-
- To summarize, consecutive jumps can lead to the following problems:
- - failure to load a jump target when the pipeline is being flushed by a
- preceding mispredicted jump.
- - a BTB entry being mis-applied to non-jumping instructions and predicting
- them to jump.
- - a second consequence of the above is that a mis-applied BTB entry will
- get its state decremented, possibly leading to a later misprediction of
- the jump it belongs to. Even unconditional jumps can be predicted to fall
- through for this reason.
- - two jump instructions may share the same BTB entry, leading to the
- prediction of a wrong target.
-
- All this mess may give you a lot of penalties, so you should definitely
- avoid having an instruction pair containing a jump immediately after another
- poorly predictable jump.
-
- It is time for an illustrative example:
-
- CALL P
- TEST EAX,EAX
- JZ L2
- L1: MOV [EDI],EBX
- ADD EDI,4
- DEC EAX
- JNZ L1
- L2: CALL P
-
- This looks like a quite nice and normal piece of code: A function call,
- a loop which is bypassed when the count is zero, and another function call.
- How many problems can you spot in this program?
-
- First, we may note that the function P is called alternatingly from two
- different locations. This means that the target for the return from P will
- be changing all the time. Consequently, the return from P will always be
- mispredicted.
-
- Assume, now, that EAX is zero. The jump to L2 will not have its target
- loaded because the mispredicted return caused a pipeline flush. Next, the
- second CALL P will also fail to have its target loaded because JZ L2
- caused a pipeline flush. Here we have the situation where a chain of
- consecutive jumps makes the pipeline flush repeatedly because the first
- jump was mispredicted. The BTB entry for JZ L2 is stored at the address
- of P's return instruction. This BTB entry will now be mis-applied to
- whatever comes after the second CALL P, but that doesn't give a penalty
- because the pipeline is flushed by the mispredicted second return.
-
- Now, let's see what happens if EAX has a nonzero value the next time:
- JZ L2 is always predicted to fall through because of the flush. The second
- CALL P has a BTB entry at the address of TEST EAX,EAX. This entry will
- be mis-applied to the MOV/ADD pair, predicting it to jump to P. This
- causes a flush which prevents JNZ L1 from loading its target. If we have
- been here before, then the second CALL P will have another BTB entry at
- the address of DEC EAX. On the second and third iteration of the loop, this
- entry will also be mis-applied to the MOV/ADD pair, until it has had its
- state decremented to 1 or 0. This will not cause a penalty on the second
- iteration because the flush from JNZ L1 prevents it from loading its false
- target, but on the third iteration it will. The subsequent iterations of the
- loop have no penalties, but when it exits, JNZ L1 is mispredicted. The
- flush would now prevent CALL P from loading its target, were it not for
- the fact that the BTB entry for CALL P has already been destroyed by
- being mis-applied several times.
-
- We can improve this code by putting in some NOP's to separate all consecutive
- jumps:
-
- CALL P
- TEST EAX,EAX
- NOP
- JZ L2
- L1: MOV [EDI],EBX
- ADD EDI,4
- DEC EAX
- JNZ L1
- L2: NOP
- NOP
- CALL P
-
- The extra NOP's cost 2 clock cycles, but they save much more. Furthermore,
- JZ L2 is now moved to the U-pipe which reduces its penalty from 4 to 3
- when mispredicted. The only problem that remains is that the returns from
- P are always mispredicted. This problem can only be solved by replacing the
- call to P by an inline macro (if you have enough code cache).
-
- The lesson to learn from this example is that you should always look
- carefully for consecutive jumps and see if you can save time by inserting
- some NOP's. You should be particularly aware of those situations where
- misprediction is unavoidable, such as loop exits and returns from procedures
- which are called from varying locations. If you have something useful to
- put in, in stead of the NOP's, then you should of course do so.
-
- Multiway branches (case statements) may be implemented either as a tree of
- branch instructions or as a list of jump addresses. If you choose to use
- a tree of branch instructions, then you have to include some NOP's or other
- instructions to separate the consecutive branches. A list of jump addresses
- may therefore be a better solution on the Pentium. The list of jump
- addresses should be placed in the data segment. Never put data in the code
- segment! On the PentiumPro you may choose the branch tree solution if it
- gives a better prediction.
-
-
- Multiple BTB entries
- --------------------
- While two jump instructions can share the same BTB entry, one jump
- instruction can also have multiple BTB entries if it is jumped to from
- different locations, or if its pairing is changing. Example:
-
- JZ L1
- MOV EAX,1
- L1: MOV EBX,2
- MOV ECX,3
- JC L2
-
- If JZ L1 falls through, then MOV EAX,1 pairs with MOV EBX,2, and
- MOV ECX,3 pairs with JC L2. The BTB entry for JC L2 will be attached
- to the address of MOV EAX,1. When JZ L1 is taken, then MOV EBX,2 will
- pair with MOV ECX,3, JC L2 will be unpaired, and its BTB entry will be
- at MOV EBX,2. So you have two BTB entries for JC L2. The first entry
- is used when the zero flag is clear, the other when set. This may actually
- be an advantage, because it will improve the predictability of the second
- branch instruction if there is a correllation between the two. Assume, for
- example, that this code sequence is preceded by a CMP instruction. Then
- we can be certain that the zero flag and the carry flag will never both be
- set at the same time. We will get no second BTB entry for JC L2 at
- MOV EBX,2 because it always falls through if the zero flag is set, and
- we will never get a misprediction for JC L2 when JZ L1 is taken.
- The situations where you can take advantage of this trick are rare,
- however, because you might as well let L1 go to another branch which
- doesn't test the carry flag at all.
-
-
- Tight loops
- -----------
- In a small loop you will often access the same BTB entry repeatedly with
- small intervals. This never causes a stall. Rather than waiting for a BTB
- entry to be updated, the Pentium somehow bypasses the pipeline and gets the
- resulting state from the last jump before it has been written to the BTB.
- This mechanism is almost transparent to the user, but it does in some cases
- have funny effects: You can see a branch prediction going from state 0 to
- state 1, rather than to state 3, if the zero has not yet been written to the
- BTB. This happens if the loop has no more than four instruction pairs. In
- loops with only two instruction pairs you may sometimes have state 0 for two
- consecutive iterations without going out of the BTB. In such small loops it
- also happens in rare cases that the prediction uses the state resulting from
- two iterations ago, rather than from the last iteration. These funny effects
- will usually not have any negative effects on performance.
-
-
- Avoiding branches
- -----------------
- Sometimes it is possible to obtain the same effect as a branch by ingenious
- manipulation of bits and flags. For example you may calculate the absoulte
- value of a signed number without branching:
- MOV EDX,EAX
- SAR EDX,31
- XOR EAX,EDX
- SUB EAX,EDX
-
- The carry flag is particularly useful for this kind of tricks.
- Setting carry if a value is zero: CMP [value],1
- Setting carry if a value is not zero: XOR EAX,EAX / CMP EAX,[value]
- Incrementing a counter if carry: ADC EAX,0
- Setting a bit for each time the carry is set: RCL EAX,1
- Generating a bit mask if carry is set: SBB EAX,EAX
-
- This example finds the minimum of two unsigned numbers: if (b < a) a = b;
- SUB EBX,EAX
- SBB ECX,ECX
- AND ECX,EBX
- ADD EAX,ECX
-
- This example chooses between two numbers: if (a != 0) a = b; else a = c;
- CMP EAX,1
- SBB EAX,EAX
- XOR ECX,EBX
- AND EAX,ECX
- XOR EAX,EBX
-
- Whether or not such tricks are worth the extra code depend on how
- predictable a conditional jump would be, whether the extra pairing
- opportunities of the branch-free code can be utilized, and whether there
- are other jumps following immediately after which could suffer the penalties
- of consecutive jumps.
-
-
- Later processors
- ----------------
- The PentiumPro and MMX processors have a more advanced branch prediction
- mechanism which enable them to correctly predict a simple repetitive
- pattern. They do not have the asymmetry flaw that the Pentium without MMX
- has. A conditional jump, which has not been seen before, is predicted
- taken if it goes backwards (e.g. a loop), and not taken if it goes
- forward, on these new processors.
-
- MMX processors have a return stack buffer (RSB) which remembers where
- calls come from, so you will no longer get mispredictions when calling
- a subroutine from varying locations. To make this mechanism work, you
- should always match calls with returns. This means that you should not
- use the return instruction as an indirect jump or use an indirect jump
- as return.
-
- Mispredictions are very expensive on the PentiumPro (10-16 clocks!). You
- should therefore avoid poorly predictable branches, and replace them with
- conditional moves whenever possible.
-
- The new processors also have problems with consecutive jumps, but for
- different reasons. So avoiding consecutive jumps may be desirable on all
- processors.
-
-
- 13. PREFIXES
- ============
- An instruction with one or more prefixes may not be able to execute in the
- V-pipe (se paragraph 8.7), and it may take more than one clock cycle to
- decode. On the Pentium without MMX, the decoding delay is one clock cycle
- for each prefix except for the 0Fh prefix of conditional near jumps.
-
- The Pentium with MMX has no decoding delay for 0Fh prefix. Segment and repeat
- prefixes take one clock extra to decode. Address and operand size prefixes
- take two clocks extra to decode (one for reading the prefix, and one for
- calculating the length of the instruction).
-
- Address size prefixes can be avoided by using 32 bit mode.
- Segment prefixes can be avoided in 32 bit mode by using a flat memory model.
- Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and
- 32 bit integers.
-
- Where prefixes are unavoidable, the decoding delay may be masked if a
- preceding instruction takes more than one clock cycle to execute. The rule
- for the Pentium without MMX is that any instruction which takes N clock
- cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1
- prefixes in the next two (sometimes three) instructions or instruction
- pairs. In other words, each extra clock cycle that an instruction takes to
- execute can be used to decode one prefix in a later instruction. This
- shadowing effect even extends across a predicted branch. Any instruction
- which takes more than one clock cycle to execute, and any instruction which
- is delayed because of an AGI stall, cache miss, misalignment, or any other
- reason except decoding delay and branch misprediction, has a shadowing
- effect.
-
- The Pentium with MMX has a similar shadowing effect, but the mechanism is
- different. Decoded instructions are stored in a transparent
- first-in-first-out (FIFO) buffer, which can hold up to four instructions.
- As long as there are instructions in the FIFO buffer you get no delay.
- When the buffer is empty then instructions are executed as soon as they
- are decoded. The buffer is filled when instructions are decoded faster
- than they are executed, i.e. when you have unpaired or multi-cycle
- instructions. The FIFO buffer is emptied when instructions execute faster
- than they are decoded, i.e. when you have decoding delays due to prefixes.
- The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can
- receive two instructions per clock cycle provided that the second
- instruction is without prefixes and none of the instructions are longer
- than 7 bytes. The two execution pipelines (U and V) can each receive one
- instruction per clock cycle from the FIFO buffer.
-
-
- Examples:
-
- CLD / REP MOVSD
- The CLD instruction takes two clock cycles and can therefore overshadow
- the decoding delay of the REP prefix. The code would take one clock cycle
- more if the CLD instruction was placed far from the REP MOVSD.
-
- CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL
- The CMP instruction takes two clock cycles here because it is a read/modify
- instruction. The 0FH prefix of the SETNZ instruction is decoded during the
- second clock cycle of the CMP instruction, so that the decoding delay is
- hidden.
-
-
- 14. REDUCING CODE SIZE
- ======================
- As explained in paragraph 6, the code cache is 8 kb. If you have problems
- keeping the critical parts of your code within the code cache, then you may
- consider reducing the size of your code.
-
- 32 bit code is usually bigger than 16 bit code because addresses and data
- constants take 4 bytes in 32 bit code and only 2 bytes in 16 bit code.
- However, 16 bit code has other penalties such as prefixes and problems with
- accessing adjacent words simultaneously (see chapter 10). Some other
- methods for reducing the size or your code are discussed below.
-
- Both jump addresses, data addresses, and data constants take less space if
- they can be expressed as a sign-extended byte, i.e. if they are within the
- interval from -128 to +127.
-
- For jump addresses this means that short jumps take two bytes of code,
- whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes if
- conditional.
-
- Likewise, data addresses take less space if they can be expressed as a
- pointer and a displacement between -128 and +127.
- Example:
- MOV EBX,DS:[100000] / ADD EBX,DS:[100004] ; 12 bytes
- Reduce to:
- MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4] ; 10 bytes
-
- The advantage of using a pointer obviously increases if you use it many
- times. Storing data on the stack and using EBP or ESP as pointer will thus
- make your code smaller than if you use static memory locations and absolute
- addresses, provided of course that your data are within 127 bytes of the
- pointer. Using PUSH and POP to write and read temporary data is even more
- advantageous.
-
- Data constants may also take less space if they are between -128 and +127.
- Most instructions with immediate operands have a short form where the
- operand is a sign-extended single byte. Examples:
- PUSH 200 ; 5 bytes
- PUSH 100 ; 2 bytes
-
- ADD EBX,128 ; 6 bytes
- SUB EBX,-128 ; 3 bytes
-
- The most important instruction with an immediate operand which doesn't have
- such a short form is MOV. Examples:
- MOV EAX, 1 ; 5 bytes
- XOR EAX,EAX / INC EAX ; 3 bytes
- PUSH 1 / POP EAX ; 3 bytes
-
- If the same constant is used more than once then you may of course load it
- into a register. Example:
- MOV DWORD PTR [EBX],0 / MOV DWORD PTR [EBX+4],0 ; 13 bytes
- XOR EAX,EAX / MOV [EBX],EAX / MOV [EBX+4],EAX ; 7 bytes
-
- You may also consider that different instructions have different lengths.
- The following instructions take only one byte and are therefore very
- attractive: PUSH reg, POP reg, INC reg32, DEC reg32.
- INC and DEC with 8 bit registers take 2 bytes, so INC EAX is shorter than
- INC AL.
-
- XCHG EAX,reg is also a single-byte instruction and thus takes less space
- than MOV EAX,reg, but it is slower and not pairable.
-
- Some instructions take one byte less when they use the accumulator than when
- they use any other register. Examples:
-
- MOV EAX,DS:[100000] is smaller than MOV EBX,DS:[100000]
- ADD EAX,1000 is smaller than ADD EBX,1000
-
- Instructions with pointers take one byte less when they have only a base
- pointer (not ESP) and a displacement than when they have a scaled index
- register, or both base pointer and index register, or ESP as base pointer.
- Examples:
- MOV EAX,[array][EBX] is smaller than MOV EAX,[array][EBX*4]
- MOV EAX,[EBP+12] is smaller than MOV EAX,[ESP+12]
- Instructions with EBP as base pointer and no displacement and no index take
- one byte more than with other registers:
- MOV EAX,[EBX] is smaller than MOV EAX,[EBP], but
- MOV EAX,[EBX+4] is same size as MOV EAX,[EBP+4].
-
-
- 15. SCHEDULING FLOATING POINT CODE
- ==================================
- Floating point instructions cannot pair the way integer instructions can,
- except for one special case, defined by the following rules:
-
- - the first instruction (executing in the U-pipe) must be FLD, FADD, FSUB,
- FMUL, FDIV, FCOM, FCHS, or FABS
-
- - the second instruction (in V-pipe) must be FXCH
-
- - the instruction following the FXCH must be a floating point instruction,
- otherwise the FXCH will take an extra clock cycle.
-
- This special pairing is important, as will be explained shortly.
-
- While floating point instructions in general cannot be paired, many can be
- pipelined, i.e. one instruction can begin before the previous instruction has
- finished. Example:
-
- FADD ST(1),ST(0) ; clock cycle 1-3
- FADD ST(2),ST(0) ; clock cycle 2-4
- FADD ST(3),ST(0) ; clock cycle 3-5
- FADD ST(4),ST(0) ; clock cycle 4-6
-
- Obviously, two instructions cannot overlap if the second instruction needs
- the result of the first. Since almost all floating point instructions
- involve the top of stack register, ST(0), there are seemingly not very many
- possibilities for making an instruction independent of the result of
- previous instructions. The solution to this problem is register renaming.
- The FXCH instruction does not in reality swap the contents of two registers,
- it only swaps their names. Instructions which push or pop the register
- stack also work by renaming. Register renaming has been highly optimized on
- the Pentium so that a register may be renamed while in use. Register
- renaming never causes stalls - it is even possible to rename a register more
- than once in the same clock cycle, as for example when you pair FLD or
- FCOMPP with FXCH.
-
- By the proper use of FXCH instructions you may obtain a lot of overlapping in
- your floating point code. Example:
-
- FLD [a1] ; clock cycle 1
- FADD [a2] ; clock cycle 2-4
- FLD [b1] ; clock cycle 3
- FADD [b2] ; clock cycle 4-6
- FLD [c1] ; clock cycle 5
- FADD [c2] ; clock cycle 6-8
- FXCH ST(2) ; clock cycle 6
- FADD [a3] ; clock cycle 7-9
- FXCH ST(1) ; clock cycle 7
- FADD [b3] ; clock cycle 8-10
- FXCH ST(2) ; clock cycle 8
- FADD [c3] ; clock cycle 9-11
- FXCH ST(1) ; clock cycle 9
- FADD [a4] ; clock cycle 10-12
- FXCH ST(2) ; clock cycle 10
- FADD [b4] ; clock cycle 11-13
- FXCH ST(1) ; clock cycle 11
- FADD [c4] ; clock cycle 12-14
- FXCH ST(2) ; clock cycle 12
-
- In the above example we are interleaving three independent threads. Each
- FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle.
- When we have started an FADD in the 'a' thread we have time to start two new
- FADD instructions in the 'b' and 'c' threads before returning to the 'a'
- thread, so every third FADD belongs to the same thread. We are using FXCH
- instructions every time to get the register that belongs to the desired
- thread into ST(0). As you can see in the example above, this generates a
- regular pattern, but note well that the FXCH instructions repeat with a
- period of two while the threads have a period of three. This can be quite
- confusing, so you have to 'play computer' in order to know which registers
- are where.
-
- All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock
- cycles and are able to overlap, so that these instructions may be scheduled
- using the method described above. Using a memory operand does not take more
- time than a register operand if the memory operand is in the level 1 cache
- and properly aligned.
-
- By now you must be used to the rules having exceptions, and the overlapping
- rule is no exception: You cannot start an FMUL instruction one clock cycle
- after another FMUL instruction, because the FMUL circuitry is not perfectly
- pipelined. It is recommended that you put another instruction in between two
- FMUL's. Example:
-
- FLD [a1] ; clock cycle 1
- FLD [b1] ; clock cycle 2
- FLD [c1] ; clock cycle 3
- FXCH ST(2) ; clock cycle 3
- FMUL [a2] ; clock cycle 4-6
- FXCH ; clock cycle 4
- FMUL [b2] ; clock cycle 5-7 (stall)
- FXCH ST(2) ; clock cycle 5
- FMUL [c2] ; clock cycle 7-9 (stall)
- FXCH ; clock cycle 7
- FSTP [a3] ; clock cycle 8-9
- FXCH ; clock cycle 10 (unpaired)
- FSTP [b3] ; clock cycle 11-12
- FSTP [c3] ; clock cycle 13-14
-
- Here you have a stall before FMUL [b2] and before FMUL [c2] because
- another FMUL started in the preceding clock cycle. You can improve this code
- by putting FLD instructions in between the FMUL's:
-
- FLD [a1] ; clock cycle 1
- FMUL [a2] ; clock cycle 2-4
- FLD [b1] ; clock cycle 3
- FMUL [b2] ; clock cycle 4-6
- FLD [c1] ; clock cycle 5
- FMUL [c2] ; clock cycle 6-8
- FXCH ST(2) ; clock cycle 6
- FSTP [a3] ; clock cycle 7-8
- FSTP [b3] ; clock cycle 9-10
- FSTP [c3] ; clock cycle 11-12
-
- In other cases you may put FADD, FSUB, or anything else in between FMUL's to
- avoid the stalls.
-
- Overlapping floating point instructions requires of course that you have
- some independent threads that you can interleave. If you have only one big
- formula to execute, then you may compute parts of the formula in parallel to
- achieve overlapping. If, for example, you want to add six numbers, then you
- may split the operations into two threads with three numbers in each, and
- add the two threads in the end:
-
- FLD [a] ; clock cycle 1
- FADD [b] ; clock cycle 2-4
- FLD [c] ; clock cycle 3
- FADD [d] ; clock cycle 4-6
- FXCH ; clock cycle 4
- FADD [e] ; clock cycle 5-7
- FXCH ; clock cycle 5
- FADD [f] ; clock cycle 7-9 (stall)
- FADD ; clock cycle 10-12 (stall)
-
- Here we have a one clock stall before FADD [f] because it is waiting for
- the result of FADD [d] and a two clock stall before the last FADD because
- it is waiting for the result of FADD [f]. The latter stall can be hidden by
- filling in some integer instructions, but the first stall can not because an
- integer instruction at this place would make the FXCH unpairable.
-
- The first stall can be avoided by having three threads rather than two, but
- that would cost an extra FLD so we do not save anything by having three
- threads rather than two unless there are at least eight numbers to add.
-
- Not all floating point instructions can overlap. And some floating point
- instructions can overlap more subsequent integer instructions than
- subsequent floating point instructions. The FDIV instruction, for example,
- takes 39 clock cycles. All but the first clock cycle can overlap with
- integer instructions, but only the last two clock cycles can overlap with
- floating point instructions. Example:
-
- FDIV ; clock cycle 1-39
- FXCH ; clock cycle 1-2
- CMC ; clock cycle 3-4
- RCR EAX,1 ; clock cycle 5
- INC EBX ; clock cycle 5
- FADD [x] ; clock cycle 38-40
- FXCH ; clock cycle 38
- FMUL [y] ; clock cycle 40-42
-
- The first FXCH pairs with the FDIV, but takes an extra clock cycle because
- it is not followed by a floating point instruction. The CMC starts before
- the FDIV is finished, but has to wait for the FXCH to finish. The RCR and
- INC instructions are pairing. The FADD has to wait till clock 38 because new
- floating point instructions can only execute during the last two clock
- cycles of the FDIV. The second FXCH pairs with the FADD. The FMUL has to
- wait for the FDIV to finish because it uses the result of the division.
-
- If you have nothing else to put in after a floating point instruction with a
- large integer overlap, such as FDIV or FSQRT, then you may put in a dummy
- read from an address which you expect to need later in the program to make
- sure it is in the level one cache. Example:
- FDIV QWORD PTR [EBX]
- CMP [ESI],EAX
- FMUL QWORD PTR [ESI]
- Here we use the integer overlap to pre-load the value at [ESI] into the
- cache while the FDIV is being computed (we don't care what the result of the
- CMP is).
-
- Paragraph 21 gives a complete listing of floating point instructions, and
- what they can pair or overlap with.
-
- One floating point instruction requires special mentioning, namely FST or
- FSTP with a memory operand. This instruction takes two clock cycles in the
- execution stage, but it seems to start converting the value in ST(0) already
- at the address decode stage in the pipeline, so that there is a one clock
- stall if the value to store is not ready one clock cycle in advance. This is
- analogous to an AGI stall. Example:
-
- FLD [a1] ; clock cycle 1
- FADD [a2] ; clock cycle 2-4
- FLD [b1] ; clock cycle 3
- FADD [b2] ; clock cycle 4-6
- FXCH ; clock cycle 4
- FSTP [a3] ; clock cycle 6-7
- FSTP [b3] ; clock cycle 8-9
-
- The FSTP [a3] stalls for one clock cycle because the result of FADD [a2]
- is not ready in the preceding clock cycle. In many cases you cannot hide
- this type of stall without scheduling your floating point code into four
- threads or putting some integer instructions in between. No other
- instructions have this weirdness. The two clock cycles in the execution
- stage of the FST(P) instruction cannot pair or overlap with any subsequent
- instructions.
-
- Instructions with integer operands such as FIADD, FISUB, FIMUL, FIDIV, FICOM
- may be split up into simpler operations in order to improve overlapping.
- Example:
-
- FILD [a] ; clock cycle 1-3
- FIMUL [b] ; clock cycle 4-9
-
- Split up into:
-
- FILD [a] ; clock cycle 1-3
- FILD [b] ; clock cycle 2-4
- FMUL ; clock cycle 5-7
-
- In this example, you save two clocks by overlapping the two FILD instructions.
-
-
- 16. Loop Optimation
- ===================
- When analyzing a program you often find that 99% of the time consumption
- lies in the innermost loop. The way to improve the speed is to carefully
- optimize the most time-consuming loop using ASM language. The rest of the
- program may be left in high-level language.
-
- A loop generally contains a counter controlling how many times to iterate,
- and often array access reading or writing one array element for each
- iteration. I have chosen as example a procedure which reads integers from an
- array, changes the sign of each integer, and stores the results in another
- array.
-
- A C language code for this procedure would be:
-
- void ChangeSign (int * A, int * B, int N) {
- int i;
- for (i=0; i<N; i++) B[i] = -A[i];}
-
- Translating to assembler, we might write the procedure like this:
-
- Example 1:
-
- _ChangeSign PROCEDURE NEAR
- PUSH ESI
- PUSH EDI
- A EQU DWORD PTR [ESP+12]
- B EQU DWORD PTR [ESP+16]
- N EQU DWORD PTR [ESP+20]
-
- MOV ECX, [N]
- JECXZ L2
- MOV ESI, [A]
- MOV EDI, [B]
- CLD
- L1: LODSD
- NEG EAX
- STOSD
- LOOP L1
- L2: POP EDI
- POP ESI
- RET ; (no extra pop if C calling convention)
- _ChangeSign ENDP
-
- This looks like a nice solution, but it is not optimal because it uses slow
- non-pairable instructions. It takes 11 clock cycles per iteration if all
- data are in the level one cache.
-
- Using pairable instructions only
- --------------------------------
-
- Example 2:
-
- MOV ECX, [N]
- MOV ESI, [A]
- TEST ECX, ECX
- JZ SHORT L2
- MOV EDI, [B]
- L1: MOV EAX, [ESI] ; u
- XOR EBX, EBX ; v (pairs)
- ADD ESI, 4 ; u
- SUB EBX, EAX ; v (pairs)
- MOV [EDI], EBX ; u
- ADD EDI, 4 ; v (pairs)
- DEC ECX ; u
- JNZ L1 ; v (pairs)
- L2:
-
- Here we have used pairable instructions only, and scheduled the instruc-
- tions so that everything pairs. It now takes only 4 clock cycles per
- iteration. We could have obtained the same speed without splitting the NEG
- instruction, but the other unpairable instructions should be split up.
-
- Using the same register for counter and index
- ---------------------------------------------
-
- Example 3:
-
- MOV ESI, [A]
- MOV EDI, [B]
- MOV ECX, [N]
- XOR EDX, EDX
- TEST ECX, ECX
- JZ SHORT L2
- L1: MOV EAX, [ESI+4*EDX] ; u
- NEG EAX ; u
- MOV [EDI+4*EDX], EAX ; u
- INC EDX ; v (pairs)
- CMP EDX, ECX ; u
- JB L1 ; v (pairs)
- L2:
-
- Using the same register for counter and index gives us fewer instructions in
- the body of the loop, but it still takes 4 clocks because we have two
- unpaired instructions.
-
- Letting the counter end at zero
- -------------------------------
- We want to get rid of the CMP instruction in example 3 by letting the
- counter end at zero and use the zero flag for detecting when we are finished
- as we did in example 2. One way to do this would be to execute the loop
- backwards taking the last array elements first. However, data caches are
- optimized for accessing data forwards, not backwards, so if cache misses are
- likely, then you should rather start the counter at -N and count through
- negative values up to zero. The base registers should then point to the end
- of the arrays rather than the beginning:
-
- Example 4:
-
- MOV ESI, [A]
- MOV EAX, [N]
- MOV EDI, [B]
- XOR ECX, ECX
- LEA ESI, [ESI+4*EAX] ; point to end of array A
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+4*EAX] ; point to end of array B
- JZ SHORT L2
- L1: MOV EAX, [ESI+4*ECX] ; u
- NEG EAX ; u
- MOV [EDI+4*ECX], EAX ; u
- INC ECX ; v (pairs)
- JNZ L1 ; u
- L2:
-
- We are now down at five instructions in the loop body but it still takes 4
- clocks because of poor pairing. (If the addresses and sizes of the arrays
- are constants we may save two registers by substituting A+SIZE A for ESI and
- B+SIZE B for EDI). Now let's see how we can improve pairing.
-
- Pairing calculations with loop overhead
- ---------------------------------------
- We may want to improve pairing by intermingling calculations with the loop
- control instructions. If we want to put something in between INC ECX and
- JNZ L1, it has to be something that doesn't affect the zero flag. The MOV
- [EDI+4*ECX],EBX instruction after INC ECX would generate an AGI delay,
- so we have to be more ingenious:
-
- Example 5:
-
- MOV EAX, [N]
- XOR ECX, ECX
- SHL EAX, 2 ; 4 * N
- JZ SHORT L3
- MOV ESI, [A]
- MOV EDI, [B]
- SUB ECX, EAX ; - 4 * N
- ADD ESI, EAX ; point to end of array A
- ADD EDI, EAX ; point to end of array B
- JMP SHORT L2
- L1: MOV [EDI+ECX-4], EAX ; u
- L2: MOV EAX, [ESI+ECX] ; v (pairs)
- XOR EAX, -1 ; u
- ADD ECX, 4 ; v (pairs)
- INC EAX ; u
- JNC L1 ; v (pairs)
- MOV [EDI+ECX-4], EAX
- L3:
-
- I have used a different way to calculate the negative of EAX here: inverting
- all bits and adding one. The reason why I am using this method is that I can
- use a dirty trick with the INC instruction: INC doesn't change the carry
- flag, whereas ADD does. I am using ADD rather than INC to increment my loop
- counter and testing the carry flag rather than the zero flag. It is then
- possible to put the INC EAX in between without affecting the carry flag. You
- may think that we could have used LEA EAX,[EAX+1] here instead of
- INC EAX, at least that doesn't change any flags, but the LEA instruction
- would have an AGI stall so that's not the best solution.
-
- I have obtained perfect pairing here and the loop now takes only 3 clock
- cycles. Whether you want to increment the loop counter by 1 (as in example
- 4) or by 4 (as in example 5) is a matter of taste, it makes no difference
- in loop timing.
-
- Overlapping the end of one operation with the beginning of the next
- -------------------------------------------------------------------
- The method used in example 5 is not very generally applicable so we may look
- for other methods of improving pairing opportunities. One way is to
- reorganize the loop so that the end of one operation overlaps with the
- beginning of the next. I will call this convoluting the loop. A convoluted
- loop has an unfinished operation at the end of each loop iteration which
- will be finished in the next run. Actually, example 5 did pair the last MOV
- of one iteration with the first MOV of the next, but we want to explore this
- method further:
-
- Example 6:
-
- MOV ESI, [A]
- MOV EAX, [N]
- MOV EDI, [B]
- XOR ECX, ECX
- LEA ESI, [ESI+4*EAX] ; point to end of array A
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+4*EAX] ; point to end of array B
- JZ SHORT L3
- XOR EBX, EBX
- MOV EAX, [ESI+4*ECX]
- INC ECX
- JZ SHORT L2
- L1: SUB EBX, EAX ; u
- MOV EAX, [ESI+4*ECX] ; v (pairs)
- MOV [EDI+4*ECX-4], EBX ; u
- INC ECX ; v (pairs)
- MOV EBX, 0 ; u
- JNZ L1 ; v (pairs)
- L2: SUB EBX, EAX
- MOV [EDI+4*ECX-4], EBX
- L3:
-
- Here we begin reading the second value before we have stored the first, and
- this of course improves pairing opportunities. The MOV EBX,0 instruction
- has been put in between INC ECX and JNZ L1 not to improve pairing but to
- avoid AGI stall.
-
- Rolling out a loop
- ------------------
- The most generally applicable way to improve pairing opportunities is to do
- two operations for each run and do half as many runs. This is called rolling
- out a loop:
-
- Example 7:
-
- MOV ESI, [A]
- MOV EAX, [N]
- MOV EDI, [B]
- XOR ECX, ECX
- LEA ESI, [ESI+4*EAX] ; point to end of array A
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+4*EAX] ; point to end of array B
- JZ SHORT L2
- TEST AL,1 ; test if N is odd
- JZ SHORT L1
- MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one
- NEG EAX
- MOV [EDI+4*ECX], EAX
- INC ECX ; make counter even
- JZ SHORT L2 ; N = 1
- L1: MOV EAX, [ESI+4*ECX] ; u
- MOV EBX, [ESI+4*ECX+4] ; v (pairs)
- NEG EAX ; u
- NEG EBX ; u
- MOV [EDI+4*ECX], EAX ; u
- MOV [EDI+4*ECX+4], EBX ; v (pairs)
- ADD ECX, 2 ; u
- JNZ L1 ; v (pairs)
- L2:
-
- Now we are doing two operations in parallel which gives the best pairing
- opportunities. We have to test if N is odd and if so do one operation
- outside the loop because the loop can only do an even number of operations.
-
- The loop has an AGI stall at the first MOV instruction because ECX has been
- incremented in the preceding clock cycle. The loop therefore takes 6 clock
- cycles for two operations.
-
- Reorganizing a loop to remove AGI stall
- ---------------------------------------
- Example 8:
-
- MOV ESI, [A]
- MOV EAX, [N]
- MOV EDI, [B]
- XOR ECX, ECX
- LEA ESI, [ESI+4*EAX] ; point to end of array A
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+4*EAX] ; point to end of array B
- JZ SHORT L3
- TEST AL,1 ; test if N is odd
- JZ SHORT L2
- MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one
- NEG EAX ; no pairing opportunity
- MOV [EDI+4*ECX-4], EAX
- INC ECX ; make counter even
- JNZ SHORT L2
- NOP ; add NOP's if JNZ L2 not predictable
- NOP
- JMP SHORT L3 ; N = 1
- L1: NEG EAX ; u
- NEG EBX ; u
- MOV [EDI+4*ECX-8], EAX ; u
- MOV [EDI+4*ECX-4], EBX ; v (pairs)
- L2: MOV EAX, [ESI+4*ECX] ; u
- MOV EBX, [ESI+4*ECX+4] ; v (pairs)
- ADD ECX, 2 ; u
- JNZ L1 ; v (pairs)
- NEG EAX
- NEG EBX
- MOV [EDI+4*ECX-8], EAX
- MOV [EDI+4*ECX-4], EBX
- L3:
-
- The trick is to find a pair of instructions that do not use the loop counter
- as index and reorganize the loop so that the counter is incremented in the
- preceding clock cycle. We are now down at 5 clock cycles for two operations
- which is close to the best possible.
-
- If data caching is critical, then you may improve the speed further by
- combining the A and B arrays into one structured array so that each B[i]
- comes immediately after the corresponding A[i]. If the structured array is
- aligned by at least 8 then B[i] will always be in the same cache line as
- A[i], so you will never have a cache miss when writing B[i]. This may of
- course have a tradeoff in other parts of the program so you have to weigh
- the costs against the benefits.
-
-
- Rolling out by more than 2
- --------------------------
- You may think of doing more than two operations per iteration in order to
- reduce the loop overhead per operation. But since the loop overhead in most
- cases can be reduced to only one clock cycle per iteration, then rolling out
- the loop by 4 rather than by 2 would only save 1/4 clock cycle per
- operation, which is hardly worth the effort. Only if the loop overhead cannot
- be reduced to one clock cycle and if N is very big, should you think of
- unrolling by 4.
-
- The drawbacks of excessive loop unrolling are:
- 1. You need to calculate N MODULO R, where R is the unrolling factor, and
- do N MODULO R operations before or after the main loop in order to make
- the remaining number of operations divisible by R. This takes a lot of
- extra code and poorly predictable branches. And the loop body of course
- also becomes bigger.
- 2. A Piece of code usually takes much more time the first time it executes,
- and the penalty of first time execution is bigger the more code you
- have, especially if N is small.
- 3. Excessive code size makes the utilization of the code cache less
- effective.
-
- Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers
- -------------------------------------------------------------------------
- If you need to manipulate arrays of 8 or 16 bit operands, then there is a
- problem with unrolled loops because you may not be able to pair two memory
- access operations. For example MOV AL,[ESI] / MOV BL,[ESI+1] will not pair
- if the two operands are within the same dword of memory. But there may be a
- much smarter method, namely to handle four bytes at a time in the same 32
- bit register.
-
- The following example adds 2 to all elements of an array of bytes:
-
- Example 9:
-
- MOV ESI, [A] ; address of byte array
- MOV ECX, [N] ; number of elements in byte array
- TEST ECX, ECX ; test if N is 0
- JZ SHORT L2
- MOV EAX, [ESI] ; read first four bytes
- L1: MOV EBX, EAX ; copy into EBX
- AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX
- XOR EBX, EAX ; get the highest bit of each byte in EBX
- ADD EAX, 02020202H ; add desired value to all four bytes
- XOR EBX, EAX ; combine bits again
- MOV EAX, [ESI+4] ; read next four bytes
- MOV [ESI], EBX ; store result
- ADD ESI, 4 ; increment pointer
- SUB ECX, 4 ; decrement loop counter
- JA L1 ; loop
- L2:
-
- This loop takes 5 clock cycles for every 4 bytes. The array should of course
- be aligned by 4. If the number of elements in the array is not divisible by
- four, then you may pad it in the end with a few extra bytes to make the
- length divisible by four. This loop will always read past the end of the
- array, so you should make sure the array is not placed at the end of a
- segment to avoid a general protection error.
-
- Note that I have masked out the highest bit of each byte to avoid a possible
- carry from each byte into the next when adding. I am using XOR rather than
- ADD when putting in the high bit again to avoid carry.
-
- The ADD ESI,4 instruction could have been avoided by using the loop
- counter as index as in example 4. However, this would give an odd number of
- instructions in the loop body, so there would be one unpaired instruction
- and the loop would still take 5 clocks. Making the branch instruction
- unpaired would save one clock after the last operation when the branch is
- mispredicted, but we would have to spend an extra clock cycle in the prolog
- code to setup a pointer to the end of the array and calculate -N, so the two
- methods will be exactly equally fast. The method presented here is the
- simplest and shortest.
-
- The next example finds the length of a zero-terminated string by searching
- for the first byte of zero. It is faster than using REP SCASB:
-
- Example 10:
-
- STRLEN PROC NEAR
- MOV EAX,[ESP+4] ; get pointer
- MOV EDX,7
- ADD EDX,EAX ; pointer+7 used in the end
- PUSH EBX
- MOV EBX,[EAX] ; read first 4 bytes
- ADD EAX,4 ; increment pointer
- L1: LEA ECX,[EBX-01010101H] ; subtract 1 from each byte
- XOR EBX,-1 ; invert all bytes
- AND ECX,EBX ; and these two
- MOV EBX,[EAX] ; read next 4 bytes
- ADD EAX,4 ; increment pointer
- AND ECX,80808080H ; test all sign bits
- JZ L1 ; no zero bytes, continue loop
- TEST ECX,00008080H ; test first two bytes
- JNZ SHORT L2
- SHR ECX,16 ; not in the first 2 bytes
- ADD EAX,2
- L2: SHL CL,1 ; use carry flag to avoid a branch
- POP EBX
- SBB EAX,EDX ; compute length
- RET ; (or RET 4 if pascal)
- STRLEN ENDP
-
- Again we have used the method of overlapping the end of one operation with
- the beginning of the next to improve pairing. I have not unrolled the loop
- because it is likely to repeat relatively few times. The string should of
- course be aligned by 4. The code will always read past the end of the
- string, so the string should not be placed at the end of a segment.
-
- The loop body has an odd number of instructions so there is one unpaired.
- Making the branch instruction unpaired rather than one of the other
- instructions has the advantage that it saves 1 clock cycle when the branch
- is mispredicted.
-
- The TEST ECX,00008080H instruction is non-pairable. You could use the
- pairable instruction OR CH,CL here instead, but then you would have to
- put in a NOP or something to avoid the penalties of consequtive branches.
- Another problem with OR CH,CL is that it would cause a partial register
- stall on a 486 or PentiumPro processor. So I have chosen to keep the
- unpairable TEST instruction.
-
- Handling 4 bytes simultaneously can be quite difficult. The code uses a
- formula which generates a nonzero value for a byte if, and only if, the
- byte is zero. This makes it possible to test all four bytes in one
- operation. This algorithm involves the subtraction of 1 from all bytes (in
- the LEA instruction). I have not masked out the highest bit of each byte
- before subtracting, as I did in the previous example, so the subtraction may
- generate a borrow to the next byte, but only if it is zero, and this is
- exactly the situation where we don't care what the next byte is, because we
- are searching forwards for the first zero. If we were searching backwards
- then we would have to re-read the dword after detecting a zero, and then
- test all four bytes to find the last zero, or use BSWAP to reverse the order
- of the bytes.
-
- If you want to search for a byte value other than zero, then you may XOR all
- four bytes with the value you are searching for, and then use the method
- above to search for zero.
-
- Handling multiple operands in the same register is easier on the MMX
- processors because they have special instructions and special 64 bit
- registers for this purpose. Using the MMX instructions has a high penalty if
- you are using floating point instructions shortly afterwards, so there may
- still be situations where you want to use 32 bit registers as in the
- examples above.
-
-
- Loops with floating point operations
- ------------------------------------
- The methods of optimizing floating point loops are basically the same as for
- integer loops, although the floating point instructions are overlapping
- rather than pairing.
-
- Consider the C language code:
-
- int i, n; double * X; double * Y; double DA;
- for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i];
-
- This piece of code (called DAXPY) has been studied extensively because it is
- the key to solving linear equations.
-
- Example 11:
-
- DSIZE = 8 ; data size
- MOV EAX, [N] ; number of elements
- MOV ESI, [X] ; pointer to X
- MOV EDI, [Y] ; pointer to Y
- XOR ECX, ECX
- LEA ESI, [ESI+DSIZE*EAX] ; point to end of X
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y
- JZ SHORT L3 ; test for N = 0
- FLD DSIZE PTR [DA]
- FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[0]
- JMP SHORT L2 ; jump into loop
- L1: FLD DSIZE PTR [DA]
- FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[i]
- FXCH ; get old result
- FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store Y[i]
- L2: FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; subtract from Y[i]
- INC ECX ; increment index
- JNZ L1 ; loop
- FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store last result
- L3:
-
- Here we are using the same methods as in example 6: Using the loop counter
- as index register and counting through negative values up to zero. The end
- of one operation overlaps with the beginning of the next.
-
- The interleaving of floating point operations work perfectly here: The 2
- clock stall between FMUL and FSUBR is filled with the FSTP of the previous
- result. The 3 clock stall between FSUBR and FSTP is filled with the loop
- overhead and the first two instructions of the next operation. An AGI stall
- has been avoided by reading the only parameter that doesn't depend on the
- index in the first clock cycle after the index has been incremented.
-
- This solution takes 6 clock cycles per operation, which is better than the
- unrolled solution published by Intel!
-
- Unrolling floating point loops
- ------------------------------
- The DAXPY loop unrolled by 3 is quite complicated:
-
- Example 12:
- DSIZE = 8 ; data size
- IF DSIZE EQ 4
- SHIFTCOUNT = 2
- ELSE
- SHIFTCOUNT = 3
- ENDIF
-
- MOV EAX, [N] ; number of elements
- MOV ECX, 3*DSIZE ; counter bias
- SHL EAX, SHIFTCOUNT ; DSIZE*N
- JZ L4 ; N = 0
- MOV ESI, [X] ; pointer to X
- SUB ECX, EAX ; (3-N)*DSIZE
- MOV EDI, [Y] ; pointer to Y
- SUB ESI, ECX ; end of pointer - bias
- SUB EDI, ECX
- TEST ECX, ECX
- FLD DSIZE PTR [ESI+ECX] ; first X
- JNS SHORT L2 ; less than 4 operations
- L1: ; main loop
- FMUL DSIZE PTR [DA]
- FLD DSIZE PTR [ESI+ECX+DSIZE]
- FMUL DSIZE PTR [DA]
- FXCH
- FSUBR DSIZE PTR [EDI+ECX]
- FXCH
- FLD DSIZE PTR [ESI+ECX+2*DSIZE]
- FMUL DSIZE PTR [DA]
- FXCH
- FSUBR DSIZE PTR [EDI+ECX+DSIZE]
- FXCH ST(2)
- FSTP DSIZE PTR [EDI+ECX]
- FSUBR DSIZE PTR [EDI+ECX+2*DSIZE]
- FXCH
- FSTP DSIZE PTR [EDI+ECX+DSIZE]
- FLD DSIZE PTR [ESI+ECX+3*DSIZE]
- FXCH
- FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
- ADD ECX, 3*DSIZE
- JS L1 ; loop
- L2: FMUL DSIZE PTR [DA] ; finish leftover operation
- FSUBR DSIZE PTR [EDI+ECX]
- SUB ECX, 2*DSIZE ; change pointer bias
- JZ SHORT L3 ; finished
- FLD DSIZE PTR [DA] ; start next operation
- FMUL DSIZE PTR [ESI+ECX+3*DSIZE]
- FXCH
- FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
- FSUBR DSIZE PTR [EDI+ECX+3*DSIZE]
- ADD ECX, 1*DSIZE
- JZ SHORT L3 ; finished
- FLD DSIZE PTR [DA]
- FMUL DSIZE PTR [ESI+ECX+3*DSIZE]
- FXCH
- FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
- FSUBR DSIZE PTR [EDI+ECX+3*DSIZE]
- ADD ECX, 1*DSIZE
- L3: FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
- L4:
-
- The reason why I am showing you how to unroll a loop by 3 is not to
- recommend it, but to warn you how difficult it is! Be prepared to spend a
- considerable amount of time debugging and verifying your code when doing
- something like this. There are several problems to take care of: In most
- cases, you cannot remove all stalls from a floating point loop unrolled by
- less than 4 unless you convolute it (i.e. there are unfinished operations at
- the end of each run which are being finished in the next run). The last FLD
- in the main loop above is the beginning of the first operation in the next
- run. It would be tempting here to make a solution which reads past the end
- of the array and then discards the extra value in the end, as in example 9
- and 10, but that is not recommended in floating point loops because the
- reading of the extra value might generate a denormal operand exception in
- case the memory position after the array doesn't contain a valid floating
- point number. To avoid this, we have to do at least one more operation after
- the main loop.
-
- The number of operations to do outside an unrolled loop would normally be
- N MODULO R, where N is the number of operations, and R is the unrolling
- factor. But in the case of a convoluted loop, we have to do one more, i.e.
- (N-1) MODULO R + 1, for the abovementioned reason.
-
- Normally, we would prefer to do the extra operations before the main loop,
- but here we have to do them afterwards for two reasons: One reason is to
- take care of the leftover operand from the convolution. The other reason is
- that calculating the number of extra operations requires a division if R is
- not a power of 2, and a division is time consuming. Doing the extra
- operations after the loop saves the division.
-
- The next problem is to calculate how to bias the loop counter so that it
- will change sign at the right time, and adjust the base pointers so as to
- compensate for this bias. Finally, you have to make sure the leftover
- operand from the convolution is handled correctly for all values of N.
-
- The epilog code doing 1-3 operations could have been implemented as a
- separate loop, but that would cost an extra branch misprediction, so the
- solution above is faster.
-
- Now that I have scared you by demonstrating how difficult it is to unroll by
- 3, I will show you that it is much easier to unroll by 4:
-
- Example 13:
-
- DSIZE = 8 ; data size
- MOV EAX, [N] ; number of elements
- MOV ESI, [X] ; pointer to X
- MOV EDI, [Y] ; pointer to Y
- XOR ECX, ECX
- LEA ESI, [ESI+DSIZE*EAX] ; point to end of X
- SUB ECX, EAX ; -N
- LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y
- TEST AL,1 ; test if N is odd
- JZ SHORT L1
- FLD DSIZE PTR [DA] ; do the odd operation
- FMUL DSIZE PTR [ESI+DSIZE*ECX]
- FSUBR DSIZE PTR [EDI+DSIZE*ECX]
- INC ECX ; adjust counter
- FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE]
- L1: TEST AL,2 ; test for possibly 2 more operations
- JZ L2
- FLD DSIZE PTR [DA] ; N MOD 4 = 2 or 3. Do two more
- FMUL DSIZE PTR [ESI+DSIZE*ECX]
- FLD DSIZE PTR [DA]
- FMUL DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
- FXCH
- FSUBR DSIZE PTR [EDI+DSIZE*ECX]
- FXCH
- FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
- FXCH
- FSTP DSIZE PTR [EDI+DSIZE*ECX]
- FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
- ADD ECX, 2 ; counter is now divisible by 4
- L2: TEST ECX, ECX
- JZ L4 ; no more operations
- L3: ; main loop:
- FLD DSIZE PTR [DA]
- FLD DSIZE PTR [ESI+DSIZE*ECX]
- FMUL ST,ST(1)
- FLD DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
- FMUL ST,ST(2)
- FLD DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE]
- FMUL ST,ST(3)
- FXCH ST(2)
- FSUBR DSIZE PTR [EDI+DSIZE*ECX]
- FXCH ST(3)
- FMUL DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE]
- FXCH
- FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
- FXCH ST(2)
- FSUBR DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
- FXCH
- FSUBR DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
- FXCH ST(3)
- FSTP DSIZE PTR [EDI+DSIZE*ECX]
- FSTP DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
- FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
- FSTP DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
- ADD ECX, 4 ; increment index by 4
- JNZ L3 ; loop
- L4:
-
- It is usually quite easy to find a stall-free solution when unrolling by 4,
- and there is no need for convolution. The number of extra operations to do
- outside the main loop is N MODULO 4, which can be calculated easily without
- division, simply by testing the two lowest bits in N. The extra operations
- are done before the main loop rather than after, to make the handling of the
- loop counter simpler.
-
- The tradeoff of loop unrolling is that the extra operations outside the loop
- are slower due to incomplete overlapping and possible branch mispredictions,
- and the first time penalty is higher because of increased code size.
-
- As a general recommendation, I would say that if N is big or if convoluting
- the loop without unrolling cannot remove enough stalls, then you should
- unroll critical integer loops by 2 and floating point loops by 4.
-
-
- 17. DISCUSSION OF SPECIAL INSTRUCTIONS
- ======================================
-
- 17.1 LEA
- --------
- The LEA instruction is useful for many purposes because it can do a shift,
- two additions, and a move in just one pairable instruction taking one clock
- cycle. Example:
- LEA EAX,[EBX+8*ECX-1000]
- is much faster than
- MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000
- The LEA instruction can also be used to do an add or shift without changing
- the flags. The source and destination need not have the same word size, so
- LEA EAX,[BX] is a useful replacement for MOVZX EAX,BX.
-
- You must be aware, however, that the LEA instruction will suffer an AGI
- stall if it uses a base or index register which has been changed in the
- preceding clock cycle.
-
- Since the LEA instruction is pairable in the V-pipe and shift instructions
- are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want
- the instruction to execute in the V-pipe.
-
- The 32 bit processors have no documented addressing mode with a scaled index
- register and nothing else, so an instruction like LEA EAX,[EAX*2] is
- actually coded as LEA EAX,[EAX*2+00000000] with an immediate displacement
- of 4 bytes. You may reduce the instruction size by instead writing LEA
- EAX,[EAX+EAX] or even better ADD EAX,EAX. The latter code cannot have an
- AGI delay. If you happen to have a register which is zero (like a loop
- counter after a loop), then you may use it as a base register to reduce the
- code size:
-
- LEA EAX,[EBX*4] ; 7 bytes
- LEA EAX,[ECX+EBX*4] ; 3 bytes
-
-
- 17.2 MOV [MEM],ACCUM
- --------------------
- The instructions MOV [mem],AL MOV [mem],AX MOV [mem],EAX
- are treated by the pairing circuitry as if they were writing to the
- accumulator. Thus the following instructions do not pair:
- MOV [mydata], EAX
- MOV EBX, EAX
-
- This problem occurs only with the short form of the MOV instruction which
- can not have a base or index register, and which can only have the
- accumulator as source. You can avoid the problem by using another register,
- by reordering your instructions, by using a pointer, or by hard-coding the
- general form of the MOV instruction.
-
- In 32 bit mode you can write the general form of MOV [mem],EAX:
- DB 89H, 05H
- DD OFFSET DS:[mem]
-
- In 16 bit mode you can write the general form of MOV [mem],AX:
- DB 89H, 06H
- DW OFFSET DS:[mem]
-
- To use AL instead of (e)AX, you replace 89H with 88H
-
-
- 17.3 TEST
- ---------
- The TEST instruction with an immediate operand is only pairable if the
- destination is AL, AX, or EAX.
-
- TEST register,register and TEST register,memory is always pairable.
-
- Examples:
- TEST ECX,ECX ; pairable
- TEST [mem],EBX ; pairable
- TEST EDX,256 ; not pairable
- TEST DWORD PTR [EBX],8000H ; not pairable
- To make it pairable, use any of the following methods:
- MOV EAX,[EBX] / TEST EAX,8000H
- MOV EDX,[EBX] / AND EDX,8000H
- MOV AL,[EBX+1] / TEST AL,80H
- MOV AL,[EBX+1] / TEST AL,AL ; (result in sign flag)
- It is also possible to test a bit by shifting it into the carry flag:
- MOV EAX,[EBX] / SHR EAX,16 ; (result in carry flag)
- but this method has a penalty on the PentiumPro when the shift count is more
- than one.
-
- (The reason for this non-pairability is probably that the first byte of the
- 2-byte instruction is the same as for some other non-pairable instructions,
- and the Pentium cannot afford to check the second byte too when determining
- pairability.)
-
-
- 17.4 XCHG
- ---------
- The XCHG register,memory instruction is dangerous. By default this
- instruction has an implicit LOCK prefix which prevents it from using the
- cache. The instruction is therefore very time consuming, and should always
- be avoided.
-
-
- 17.5 rotates through carry
- --------------------------
- RCR and RCL with a count different from one are slow and should be avoided.
-
-
- 17.6 string instructions
- ------------------------
- String instructions without a repeat prefix are too slow, and should always
- be replaced by simpler instructions. The same applies to LOOP and JECXZ.
-
- String instructions with repeat may be optimal. Always use the dword version
- if possible, and make sure that both source and destination are aligned by 4.
-
- REP MOVSD is the fastest way to move blocks of data when the destination is
- in the level 1 cache. See section 19 for an alternative.
- Note that while the REP MOVSD instruction writes a DWORD to the destination,
- it reads the next DWORD from the source in the same clock cycle. According
- to rule 10.3 above, you have a cache bank conflict if bit 2-4 are the same
- in these two addresses. In other words, you will get a penalty of one clock
- extra per iteration if ESI+4-EDI is divisible by 32. This can be avoided if
- both source and destination are aligned by 8.
-
- REP STOSD is optimal when the destination is in the cache.
-
- REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by
- loops. See section 16 example 10 for an alternative to REP SCASB.
-
-
- 17.7 bit scan
- -------------
- BSF and BSR are the poorest optimized instructions on the Pentium, taking
- approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.
- (on later processors it takes only 1 or 2)
-
- The following code emulates BSR ECX,EAX:
- TEST EAX,EAX
- JZ SHORT BS1
- MOV DWORD PTR [TEMP],EAX
- MOV DWORD PTR [TEMP+4],0
- FILD QWORD PTR [TEMP]
- FSTP QWORD PTR [TEMP]
- WAIT ; WAIT only needed for compatibility with earlier processors
- MOV ECX, DWORD PTR [TEMP+4]
- SHR ECX,20
- SUB ECX,3FFH
- TEST EAX,EAX ; clear zero flag
- BS1:
-
- The following code emulates BSF ECX,EAX:
- TEST EAX,EAX
- JZ SHORT BS2
- XOR ECX,ECX
- MOV DWORD PTR [TEMP+4],ECX
- SUB ECX,EAX
- AND EAX,ECX
- MOV DWORD PTR [TEMP],EAX
- FILD QWORD PTR [TEMP]
- FSTP QWORD PTR [TEMP]
- WAIT ; WAIT only needed for compatibility with earlier processors
- MOV ECX, DWORD PTR [TEMP+4]
- SHR ECX,20
- SUB ECX,3FFH
- TEST EAX,EAX ; clear zero flag
- BS2:
-
-
- 17.8 bit test
- -------------
- BT, BTC, BTR, and BTS instructions should preferably be replaced by
- instructions like TEST, AND, OR, XOR, or shifts.
-
-
- 17.9 integer multiplication
- ---------------------------
- An integer multiplication takes approximately 9 clock cycles. It is
- therefore advantageous to replace a multiplication by a constant with a
- combination of other instructions such as SHL, ADD, SUB, and LEA. Example:
- IMUL EAX,10
- can be replaced with
- MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX
- or
- LEA EAX,[EAX+4*EAX] / ADD EAX,EAX
-
- Floating point multiplication is faster than integer multiplication on a
- Pentium without MMX, but the time used to convert integers to float and
- convert the product back again is usually more than the time saved by using
- floating point multiplication, except when the number of conversions is low
- compared with the number of multiplications.
-
-
- 17.10 division
- --------------
- Division is quite time consuming. The DIV instruction takes 17, 25, or 41
- clock cycles for byte, word, and dword divisors respectively. The IDIV
- instruction takes 5 clock cycles more. It is therefore preferable to use the
- smallest operand size possible that won't generate an overflow, even if it
- costs an operand size prefix, and use unsigned division if possible.
-
- Unsigned division by a power of two can be done with SHR. Division of a
- signed number by a power of two can be done with SAR, but the result with
- SAR is rounded towards minus infinity, whereas the result with IDIV is
- truncated towards zero.
-
- Dividing a positive number N (< 2^16) with a constant D which is not a
- power of two can be done by multiplying N with 2^16 / D and then dividing
- with 2^16. For example, to divide a number with 5, you may multiply with
- 2^16 / 5 = 3333H, and then divide with 2^16:
-
- INC EAX / IMUL EAX,3333H / SHR EAX,16
-
- The INC EAX is added to compensate for the double rounding error.
- This method works for 0 <= N <= 2^16. You have to test the code thoroughly
- to make sure the rounding is correct.
-
- If you want to replace the IMUL with faster pairable instructions, then
- you may write:
-
- LEA EBX,[EAX+2*EAX+3]
- LEA ECX,[EAX+2*EAX+3]
- SHL EBX,4
- MOV EAX,ECX
- SHL ECX,8
- ADD EAX,EBX
- SHL EBX,8
- ADD EAX,ECX
- ADD EAX,EBX
- SHR EAX,16
-
- You can use the same code to divide by 10. Just shift right by 17
- instead of 16 in the end.
-
-
- Floating point division takes 39 clock cycles for the highest precision.
- You can save time by specifying a lower precision in the floating point
- control word (only FDIV and FIDIV are faster at low precision, no other
- instructions can be speeded up this way).
-
- It is possible to do a floating point division and an integer division in
- parallel to save time. Example: A = A1 / A2; B = B1 / B2
-
- FILD [B1]
- FILD [B2]
- MOV EAX,[A1]
- MOV EBX,[A2]
- CDQ
- FDIV
- DIV EBX
- FISTP [B]
- MOV [A],EAX
- (make sure you set the floating point control word to the desired
- rounding method)
-
- Obviously, you should always try to minimize the number of divisions.
- Floating point division with a constant or repeated division with the same
- value should of course by done by multiplying with the reciprocal.
- But there are many other situations where you can reduce the number of
- divisions. For example:
- if (A/B > C)... can be rewritten as if (A > B*C)... when B is
- positive, and the opposite when B is negative.
-
- A/B + C/D can be rewritten as (A*D + C*B) / (B*D)
-
- If you are using integer division, then you should be aware that the
- rounding errors may be different when you rewrite the formulas.
-
-
- 17.11 WAIT
- ----------
- You can often increase speed by omitting the WAIT instruction.
- The WAIT instruction has three functions:
-
- a. The 8087 processor requires a WAIT before _every_ floating point
- instruction to make sure the coprocessor is ready to receive it.
-
- b. WAIT is used to coordinate memory access between the floating point
- unit and the integer unit. Examples:
- b.1. FISTP [mem32]
- WAIT ; wait for floating point unit to write before..
- MOV EAX,[mem32] ; reading the result with the integer unit
-
- b.2. FILD [mem32]
- WAIT ; wait for floating point unit to read value..
- MOV [mem32],EAX ; before overwriting it with integer unit
-
- b.3. FLD QWORD PTR [ESP]
- WAIT ; prevent an accidental hardware interrupt from..
- ADD ESP,8 ; overwriting value on stack before it is read
-
- c. WAIT is sometimes used to check for exceptions. It will generate an
- interrupt if an unmasked exception bit in the floating point status
- word has been set by a preceding floating point instruction.
-
- Regarding a:
- The function in point a is never needed on any other processors than the old
- 8087. Unless you want your code to be compatible with the 8087 you should
- tell your assembler not to put in these WAITs by specifying a higher
- processor. A 8087 floating point emulator also inserts WAIT instructions.
- You should therefore tell your assembler not to generate emulation code
- unless you need it.
-
- Regarding b:
- WAIT instructions to coordinate memory access are definitely needed on the
- 8087 and 80287 but not on the Pentium. It is not quite clear whether it is
- needed on the 80387 and 80486. I have made several tests on these Intel
- processors and not been able to provoke any error by omitting the WAIT on
- any 32 bit Intel processor, although Intel manuals say that the WAIT is
- needed for this purpose except after FNSTSW and FNSTCW. Omitting WAIT
- instructions for coordinating memory access is not 100 % safe, even when
- writing 32 bit code, because the code may be able to run on a 80386 main
- processor with a 287 coprocessor, which requires the WAIT. Also, I have not
- tested all possible hardware and software combinations, so there may be
- other situations where the WAIT is needed.
-
- If you want to be certain that your code will work on _any_ 32 bit processor
- (including non-Intel processors) then I would recommend that you include the
- WAIT here in order to be safe.
-
- Regarding c:
- The assembler automatically inserts a WAIT for this purpose before the
- following instructions: FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW.
- You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT
- is unneccessary in most cases because these instructions without WAIT will
- still generate an interrupt on exceptions except for FNCLEX and FNINIT on
- the 80387. (There is some inconsistency about whether the IRET from the
- interrupt points to the FN.. instruction or to the next instruction).
-
- Almost all other floating point instructions will also generate an interrupt
- if a previous floating point instruction has set an unmasked exception bit,
- so the exception is likely to be detected sooner or later anyway. You may
- insert a WAIT after the last floating point instruction in your program to
- be sure to catch all exceptions.
-
- You may still need the WAIT if you want to know exactly where an exception
- occurred in order to be able to recover from the situation. Consider, for
- example, the code under b.3 above: If you want to be able to recover from
- an exception generated by the FLD here, then you need the WAIT because an
- interrupt after ADD ESP,8 would overwrite the value to load.
-
-
- 17.12 FCOM + FSTSW AX
- ---------------------
- The usual way of doing floating point comparisons is:
- FLD [a]
- FCOMP [b]
- FSTSW AX
- SAHF
- JB ASmallerThanB
-
- You may improve this code by using FNSTSW AX rather than FSTSW AX and
- test AH directly rather than using the non-pairable SAHF.
- (TASM version 3.0 has a bug with the FNSTSW AX instruction)
-
- FLD [a]
- FCOMP [b]
- FNSTSW AX
- SHR AH,1
- JC ASmallerThanB
-
- Testing for zero or equality:
-
- FTST
- FNSTSW AX
- AND AH,40H
- JNZ IsZero ; (the zero flag is inverted!)
-
- Test if greater:
-
- FLD [a]
- FCOMP [b]
- FNSTSW AX
- AND AH,41H
- JZ AGreaterThanB
-
- Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as
- it would produce a partial register stall on the PentiumPro. Do not test the
- flags after multibit shifts, as this has a penalty on the PentiumPro.
-
- The FNSTSW instruction delays for 4 clocks after any floating point
- instruction (probably because it is waiting for the status word to retire
- from the pipeline). If you can fill this latency with integer instructions
- between FCOM and FNSTSW, then the FNSTSW instruction will take only 2
- clocks, rather than 6. Make sure you use the version without WAIT prefix in
- this case.
-
- It is some times faster to use integer instructions for comparing floating
- point values, as described in paragraph 18 below.
-
-
- 17.13 freeing floating point registers
- --------------------------------------
- You have to free all used floating point registers before exiting a
- subroutine, except for any register used for the result.
-
- The fastest way to free one register is FSTP ST.
- The fastest way to free two registers is FCOMPP.
- It is not recommended to use FFREE.
-
-
- 17.14 FISTP
- -----------
- Converting a floating point number to integer is normally done like this:
- FISTP DWORD PTR [TEMP]
- MOV EAX, [TEMP]
-
- An alternative method is:
-
- .DATA
- ALIGN 8
- TEMP DQ ?
- MAGIC DD 59C00000H ; f.p. representation of 2^51 + 2^52
-
- .CODE
- FADD [MAGIC]
- FSTP QWORD PTR [TEMP]
- MOV EAX, DWORD PTR [TEMP]
-
- Adding the 'magic number' of 2^51 + 2^52 has the effect that any integer
- between -2^31 and +2^31 will be aligned in the lower 32 bits when storing
- as a double precision floating point number. The result is the same as you
- get with FISTP for all rounding methods except truncation towards zero.
- The result is different from FISTP if the control word specifies truncation
- or in case of overflow. You may need a WAIT instruction for compatibility
- with other processors. See section 17.11.
-
- This method is not faster than using FISTP, but it gives better scheduling
- opportunities because there is a 3 clock void between FADD and FSTP which
- may be filled with other instrucions. You may multiply or divide the number
- by a power of 2 in the same operation by doing the opposite to the magic
- number. You may also add a constant by adding it to the magic number,
- which then has to be double precision.
-
-
- 17.15 FPTAN
- -----------
- According to the manuals, FPTAN returns two values X and Y and leaves it to
- the programmer to divide Y with X to get the result, but in fact it always
- returns 1 in X so you can save the division. My tests show that on all 32
- bit Intel processors with floating point unit or coprocessor, FPTAN always
- returns 1 in X regardless of the argument. If you want to be absolutely sure
- that your code will run correctly on all processors, then you may test if X
- is 1, which is faster than dividing with X. The Y value may be very high,
- but never infinity, so you don't have to test if Y contains a valid number
- if you know that the argument is valid.
-
-
-
- 18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS
- ==============================================================
- Integer instructions are generally faster than floating point instructions,
- so it is often advantageous to use integer instructions for doing simple
- floating point operations. The most obvious example is moving data. Example:
-
- FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]
-
- Change to:
-
- MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX
-
- The former code takes 4 clocks, the latter takes 2.
-
- Testing if a floating point value is zero:
-
- The floating point value of zero is usually represented as 32 or 64 bits of
- zero, but there is a pitfall here: The sign bit may be set! Minus zero is
- regarded as a valid floating point number, and the processor may actually
- generate a zero with the sign bit set if for example multiplying a negative
- number with zero. So if you want to test if a floating point number is zero,
- you should not test the sign bit. Example:
-
- FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero
-
- Use integer instructions instead, and shift out the sign bit:
-
- MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero
-
- The former code takes 9 clocks, the latter takes only 2.
- If the floating point number is double precision (QWORD) then you only have
- to test bit 32-62. If they are zero, then the lower half will also be zero
- if it is a normal floating point number.
-
- Testing if negative:
- A floating point number is negative if the sign bit is set and at least one
- other bit is set. Example:
- MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative
-
- Manipulating the sign bit:
- You can change the sign of a floating point number simply by flipping the
- sign bit. Example:
- XOR BYTE PTR [a] + (TYPE a) - 1, 80H
-
- Likewise you may get the absolute value of a floating point number by simply
- ANDing out the sign bit.
-
- Comparing numbers:
- Floating point numbers are stored in a unique format which allows you to use
- integer instructions for comparing floating point numbers, except for the
- sign bit. If you are certain that two floating point numbers both are normal
- and positive then you may simply compare them as integers. Example:
-
- FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB
-
- Change to:
-
- MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB
-
- This method only works if the two numbers have the same precision and you
- are certain that none of the numbers have the sign bit set.
-
- If negative numbers are possible, then you have to convert the
- negative numbers to 2-complement, and do a signed compare:
-
- MOV EAX,[a]
- MOV EBX,[b]
- MOV ECX,EAX
- MOV EDX,EBX
- SAR ECX,31 ; copy sign bit
- AND EAX,7FFFFFFFH ; remove sign bit
- SAR EDX,31
- AND EBX,7FFFFFFFH
- XOR EAX,ECX ; make 2-complement if sign bit was set
- XOR EBX,EDX
- SUB EAX,ECX
- SUB EBX,EDX
- CMP EAX,EBX
- JL ASmallerThanB ; signed comparison
-
- This method works for all normal floating point numbers, including -0.
-
-
- 19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS
- ==============================================================
- 19.1 Moving data
- ----------------
- Floating point instructions can be used to move 8 bytes at a time:
- FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI]
- This is only an advantage if the destination is not in the cache. The
- optimal way to move a block of data to uncached memory on the Pentium is:
-
- TopOfLoop:
- FILD QWORD PTR [ESI]
- FILD QWORD PTR [ESI+8]
- FXCH
- FISTP QWORD PTR [EDI]
- FISTP QWORD PTR [EDI+8]
- ADD ESI,16
- ADD EDI,16
- DEC ECX
- JNZ TopOfLoop
-
- The source and destination should of course be aligned by 8. The extra time
- used by the slow FILD and FISTP instructions is compensated for by the fact
- that you only have to do half as many write operations. Note that this
- method is only advantageous on the Pentium and only if the destination is
- not in the level 1 cache. On all other processors the optimal way to move
- blocks of data is REP MOVSD, or if you have a processor with MMX you may use
- the MMX instructions instead to write 8 bytes at a time.
-
- 19.2 Integer multiplication
- ---------------------------
- Floating point multiplication is faster than integer multiplication on the
- Pentium without MMX, but the price for converting integer factors to float
- and converting the result back to integer is high, so floating point
- multiplication is only advantageous if the number of conversions needed is
- low compared with the number of multiplications. Integer multiplication is
- faster than floating point on other processors.
-
- It may be tempting to use denormal floating point operands to save some of
- the conversions here, but the handling of denormals is very slow, so this
- is not a good idea!
-
- 19.3 Integer division
- ---------------------
- Floating point division is not faster than integer division, but you can do
- other integer operations (including integer division, but not integer
- multiplication) while the floating point unit is working on the division.
- See paragraph 17.10 above for an example.
-
- 19.4 Converting binary to decimal numbers
- -----------------------------------------
- Using the FBSTP instruction is a simple and convenient way of converting a
- binary number to decimal, although not necessarily the fastest method.
-
-
- 20. LIST OF INTEGER INSTRUCTIONS
- ================================
- Explanations:
- Operands: r=register, m=memory, i=immediate data, sr=segment register
- m32= 32 bit memory operand, etc.
-
- Clock cycles:
- The numbers are minimum values. Cache misses, misalignment, and exceptions
- may increase the clock counts considerably.
-
- Pairability:
- u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe,
- np=not pairable
-
- Opcode Operands Clock cycles Pairability
- ----------------------------------------------------------------------------
- NOP 1 uv
- MOV r/m, r/m/i 1 uv
- MOV r/m, sr 1 np
- MOV sr , r/m >= 2 b) np
- MOV m , accum 1 uv h)
- XCHG (E)AX, r 2 np
- XCHG r , r 3 np
- XCHG r , m >20 np
- XLAT 4 np
- PUSH r/i 1 uv
- POP r 1 uv
- PUSH m 2 np
- POP m 3 np
- PUSH sr 1 b) np
- POP sr >= 3 b) np
- PUSHF 4 np
- POPF 6 np
- PUSHA POPA 5 np
- LAHF SAHF 2 np
- MOVSX MOVZX r, r/m 3 a) np
- LEA r/m 1 uv
- LDS LES LFS LGS LSS m 4 c) np
- ADD SUB AND OR XOR r , r/i 1 uv
- ADD SUB AND OR XOR r , m 2 uv
- ADD SUB AND OR XOR m , r/i 3 uv
- ADC SBB r , r/i 1 u
- ADC SBB r , m 2 u
- ADC SBB m , r/i 3 u
- CMP r , r/i 1 uv
- CMP m , r/i 2 uv
- TEST r , r 1 uv
- TEST m , r 2 uv
- TEST r , i 1 f)
- TEST m , i 2 np
- INC DEC r 1 uv
- INC DEC m 3 uv
- NEG NOT r/m 1/3 np
- MUL IMUL r8/r16/m8/m16 11 np
- MUL IMUL all other versions 9 d) np
- DIV r8/r16/r32 17/25/41 np
- IDIV r8/r16/r32 22/30/46 np
- CBW CWDE 3 np
- CWD CDQ 2 np
- SHR SHL SAR SAL r , i 1 u
- SHR SHL SAR SAL m , i 3 u
- SHR SHL SAR SAL r/m, CL 4/5 np
- ROR ROL RCR RCL r/m, 1 1/3 u
- ROR ROL r/m, i(><1) 1/3 np
- ROR ROL r/m, CL 4/5 np
- RCR RCL r/m, i(><1) 8/10 np
- RCR RCL r/m, CL 7/9 np
- SHLD SHRD r, i/CL 4 a) np
- SHLD SHRD m, i/CL 5 a) np
- BT r, r/i 4 a) np
- BT m, i 4 a) np
- BT m, r 9 a) np
- BTR BTS BTC r, r/i 7 a) np
- BTR BTS BTC m, i 8 a) np
- BTR BTS BTC m, r 14 a) np
- BSF BSR r , r/m 7-73 a) np
- SETcc r/m 1/2 a) np
- JMP CALL short/near 1/4 e) v
- JMP CALL far >= 3 e) np
- conditional jump short/near 1/4/5 e) v
- CALL JMP r/m 2/5 e) np
- RETN 2/5 e) np
- RETN i 3/6 e) np
- RETF 4/7 e) np
- RETF i 5/8 e) np
- J(E)CXZ short 5-8 e) np
- LOOP short 5-9 e) np
- BOUND r , m 8 np
- CLC STC CMC CLD STD 2 np
- CLI STI 6-7 np
- LODS 2 np
- REP LODS 7+3*n g) np
- STOS 3 np
- REP STOS 10+n g) np
- MOVS 4 np
- REP MOVSB 12+1.8*n g) np
- REP MOVSW 12+1.5*n g) np
- REP MOVSD 12+n g) np
- SCAS 4 np
- REP(N)E SCAS 9+4*n g) np
- CMPS 5 np
- REP(N)E CMPS 8+5*n g) np
- BSWAP 1 a) np
- CPUID 13/15/16 a) np
- RDTSC 6 a) np
- ----------------------------------------------------------------------------
- Notes:
- a) this instruction has a 0FH prefix which takes one clock cycle extra to
- decode on a Pentium without MMX unless preceded by a multicycle
- instruction (see section 13 above).
- b) versions with FS and GS have a 0FH prefix. see note a.
- c) versions with SS, FS, and GS have a 0FH prefix. see note a.
- d) versions with two operands and no immediate have a 0FH prefix, see note a.
- e) see section 12 above
- f) only pairable if register is accumulator. see paragraph 17.3 above
- g) add one clock cycle for decoding the repeat prefix unless preceded by a
- multicycle instruction (such as CLD. see section 13 above).
- h) pairs as if it were writing to the accumulator. see paragraph 17.2 above.
-
-
- 21. LIST OF FLOATING POINT INSTRUCTIONS
- =======================================
- Explanations:
- Operands: r=register, m=memory, m32=32 bit memory operand, etc.
-
- Clock cycles:
- The numbers are minimum values. Cache misses, misalignment, denormal
- operands, and exceptions may increase the clock counts considerably.
-
- Pairability:
- +=pairable with FXCH, np=not pairable with FXCH.
-
- i-ov:
- Overlap with integer instructions. i-ov = 4 means that the last four clock
- cycles can overlap with subsequent integer instructions.
-
- fp-ov:
- Overlap with floating point instructions. fp-ov = 2 means that the last two
- clock cycles can overlap with subsequent floating point instructions.
- (WAIT is considered a floating point instruction here)
-
- Opcode Operand Clock cycles Pairability i-ov fp-ov
- -----------------------------------------------------------------------------
- FLD r/m32/m64 1 + 0 0
- FLD m80 3 np 0 0
- FBLD m80 48-58 np 0 0
- FST(P) r 1 np 0 0
- FST(P) m32/m64 2 i) np 0 0
- FST(P) m80 3 i) np 0 0
- FBSTP m80 148-154 np 0 0
- FILD m 3 np 2 2
- FIST(P) m 6 np 0 0
- FLDZ FLD1 2 np 0 0
- FLDPI FLDL2E etc. 5 np 0 0
- FNSTSW AX/m16 6 n) np 0 0
- FLDCW m16 8 np 0 0
- FNSTCW m16 2 np 0 0
-
- FADD(P) r/m 3 + 2 2
- FSUB(R)(P) r/m 3 + 2 2
- FMUL(P) r/m 3 + 2 2 j)
- FDIV(R)(P) r/m 19/33/39 m) + k) 38 l) 2
- FCHS FABS 1 + 0 0
- FCOM(P)(P) FUCOM r/m 1 + 0 0
- FIADD FISUB(R) m 6 np 2 2
- FIMUL m 6 np 2 2
- FIDIV(R) m 22/36/42 m) np 38 l) 2
- FICOM m 4 np 0 0
- FTST 1 np 0 0
- FXAM 17-21 np 4 0
- FPREM 16-64 np 2 2
- FPREM1 20-70 np 2 2
- FRNDINT 9-20 np 0 0
- FSCALE 20-32 np 5 0
- FXTRACT 12-66 np 0 0
-
- FSQRT 70 np 69 l) 2
- FSIN FCOS 16-126 np 2 2
- FSINCOS 17-137 np 2 2
- F2XM1 13-57 np 2 2
- FYL2X 22-111 np 2 2
- FYL2XP1 22-103 np 2 2
- FPATAN 19-134 np 2 2
- FPTAN 17-173 np 36 l) 0
-
- FNOP 2 np 0 0
- FXCH r 1 np 0 0
- FINCSTP FDECSTP 2 np 0 0
- FFREE r 2 np 0 0
- FNCLEX 6-9 np 0 0
- FNINIT 12-22 np 0 0
- FNSAVE m 124-300 np 0 0
- FRSTOR m 70-95 np 0 0
- WAIT 1 np 0 0
- -----------------------------------------------------------------------------
- Notes:
- i) The value to store is needed one clock cycle in advance.
- j) 1 if the overlapping instruction is also an FMUL.
- k) If the FXCH is followed by an integer instruction then it will pair
- imperfectly and take an extra clock cycle so that the integer
- instruction will begin in clock cycle 3.
- l) Cannot overlap integer multiplication instructions.
- m) FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision
- respectively. FIDIV takes 3 clocks more. The precision is defined by bit
- 8-9 of the floating point control word.
- n) The first 4 clock cycles can overlap with preceding integer instructions.
- See section 17.12 above.
-
-
- 22. TESTING SPEED
- =================
- The Pentium has an internal 64 bit clock counter which can be read into
- EDX:EAX using the instruction RDTSC (read time stamp counter). This is very
- useful for testing exactly how many clock cycles a piece of code takes.
-
- The program below is useful for measuring the number of clock cycles a piece
- of code takes. The program executes the code to test 10 times and stores the
- 10 clock counts. The program can be used in both 16 and 32 bit mode.
-
- RDTSC MACRO ; define RDTSC instruction
- DB 0FH,31H
- ENDM
-
- ITER EQU 10 ; number of iterations
-
- .DATA ; data segment
- ALIGN 4
- COUNTER DD 0 ; loop counter
- TICS DD 0 ; temporary storage of clock
- RESULTLIST DD ITER DUP (0) ; list of test results
-
- .CODE ; code segment
- BEGIN: MOV [COUNTER],0 ; reset loop counter
- TESTLOOP: ; test loop
- ;**************** Do any initializations here: ************************
- FINIT
- ;**************** End of initializations ************************
- RDTSC ; read clock counter
- MOV [TICS],EAX ; save count
- CLD ; non-pairable filler
- REPT 8
- NOP ; eight NOP's to avoid shadowing effect
- ENDM
-
- ;**************** Put instructions to test here: ************************
- FLDPI ; this is only an example
- FSQRT
- RCR EBX,10
- FSTP ST
- ;********************* End of instructions to test ************************
-
- CLC ; non-pairable filler with shadow
- RDTSC ; read counter again
- SUB EAX,[TICS] ; compute difference
- SUB EAX,15 ; subtract the clock cycles used by fillers etc
- MOV EDX,[COUNTER] ; loop counter
- MOV [RESULTLIST][EDX],EAX ; store result in table
- ADD EDX,TYPE RESULTLIST ; increment counter
- MOV [COUNTER],EDX ; store counter
- CMP EDX,ITER * (TYPE RESULTLIST)
- JB TESTLOOP ; repeat ITER times
-
- ; insert here code to read out the values in RESULTLIST
-
-
- The 'filler' instructions before and after the piece of code to test are
- critical. The CLD is a non-pairable instruction which has been inserted to
- make sure the pairing is the same the first time as the subsequent times.
- The eight NOP instructions are inserted to prevent any prefixes in the code
- to test to be decoded in the shadow of the preceding instructions. Single
- byte instructions are used here to obtain the same pairing the first time as
- the subsequent times. The CLC after the code to test is a non-pairable
- instruction which has a shadow under which the 0FH prefix of the RDTSC can
- be decoded so that it is independent of any shadowing effect from the code
- to test.
-
- On the PentiumPro you have to put in a serializing instruction like CPUID
- before and after each RDTSC to prevent it from executing in parallel with
- anything else. CPUID has a decoding delay on the Pentium without MMX because
- of the 0FH prefix, but it has no shadow because it is serializing.
-
- The RDTSC instruction cannot execute in virtual mode, so if you are running
- under DOS you must skip the EMM386 (or any other memory manager) in your
- CONFIG.SYS and not run under a DOS box in Windows.
-
- The Pentium processor has special performance monitor counters which can
- count events such as cache misses, misalignments, AGI stalls, etc. Details
- about how to use the performance monitor counters are not covered by this
- manual and must be sought elsewhere (see chapter 2).
-
-
- 23. CONSIDERATIONS FOR OTHER MICROPROCESSORS
- ============================================
- Most of the optimations described in this document have little or no
- negative effects on other microprocessors, including non-Intel processors,
- but there are some problems to be aware of.
-
- Using a full register after writing to part of the register will cause a
- moderate delay on the 80486 and a severe delay on the PentiumPro and
- PentiumII. This is called a partial register stall. Example:
- MOV AL,[EBX] / MOV ECX,EAX
- You may avoid this penalty by zeroing the full register first:
- XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX
- or by using MOVZX EAX,BYTE PTR [EBX]
- A similar penalty occurs when you address the same piece of memory with
- different data sizes. Example:
- MOV [ESI],EAX / MOV BL,[ESI]
-
- Scheduling floating point code for the Pentium often requires a lot of extra
- FXCH instructions. This will slow down execution on earlier microprocessors,
- but not on the PentiumPro and advanced non-Intel processors.
-
- The MMX instructions available on the PentiumMMX and PentiumII processors
- are very useful for massively parallel integer calculations.
-
- The PentiumPro chip is faster than the Pentium in some respects, but
- inferior in other respects. Knowing the strong and weak sides of the
- PentiumPro can help you make your code work well on both processors.
-
- The most important advantage of the PentiumPro is that it does much of the
- optimation for you: reordering instructions and splitting complex
- instructions into simple ones. But for perfectly optimized code there is
- less difference between the two processors.
-
- The two processors have basically the same number of execution units, so
- the throughput should be near the same. The PPro has separate units for
- memory read and write so that it can do three operations simultaneously
- if one of them is a memory read, but on the other hand it cannot do two
- memory reads or two writes simultaneously as the Pentium can.
-
- The PPro is better than the Pentium in the following respects:
-
- - out of order execution
-
- - splitting complex instructions into smaller micro-ops
-
- - one cache miss does not delay subsequent independent instructions
-
- - automatic register renaming to avoid unnecessary dependencies
-
- - better jump prediction method than Pentium without MMX
-
- - many instructions which are unpairable and poorly optimized on the
- Pentium perform better on the PPro, f.ex. integer multiplication,
- movzx, cdq, bit scan, bit test, shifts by cl, and floating point store
-
- - floating point instructions and simple integer instructions can execute
- simultaneously
-
- - memory reads and writes do not occupy the ALU's
-
- - indirect memory read instructions have no AGI stall
-
- - new conditional move instructions can be used instead of branches in
- some cases
-
- - new FCOMI instruction eliminates the need for the slow FNSTSW AX
-
- The PPro is inferior to the Pentium in the following respects:
-
- - mispredicted jumps are very expensive (10-15 clock cycles!)
-
- - poor performance on 16 bit code and segmented models
-
- - prefixes are expensive (except 0FH extended opcode)
-
- - long stall when mixing 8, 16, and 32 bit registers
-
- - fadd, fsub, fmul, fchs have longer latency
-
- - cannot do two memory reads or two memory writes simultaneously
-
- - some instruction combinations cannot execute in parallel, like
- push+push, push+call, compare+conditional jump
-
- As a consequence of this, the Pentium Pro may actually be slower than the
- Pentium on perfectly optimized code with a lot of unpredictable branches,
- and a lot of floating point code with little or no natural parallelism.
- In most other cases the PPro is faster.
-
- Most of the drawbacks of each processor can be circumvented by careful
- optimation and running 32 bit flat mode. But the problem with mispredicted
- jumps on the PPro cannot be avoided except in the cases where you can use
- a conditional move instead.
-
- Taking advantage of the new instructions in the MMX and PentiumPro
- processors will create problems if you want your code to be compatible
- with earlier microprocessors. The solution may be to write several
- versions of your code, each optimized for a particular processor. Your
- program should automatically detect which processor it is running on and
- select the appropriate version of code. Such a complicated approach is of
- course only needed for the most critical parts of your program.
-
-